Major Bump to Ingress Bandwidth

Toyoo · May 3, 2025, 1:36pm

I’m usually looking at totals over 4 weeks, these don’t change much… unless there’s another GC bug (-:

Roxor · May 3, 2025, 3:14pm

Don’t make me look! Please don’t make me look back four weeks.

(because it’s the same number…)

MarviBiene · May 3, 2025, 8:20pm

I have a small question to everyone (or the StorJ team):
Are the bloom filters send to every node at the same time? And if yes, wouldn’t it (or does it) impact the network performance while every node is processing the the Bloom filters?

Toyoo · May 3, 2025, 8:40pm

No idea whether this is actually sent at the same time, but the implementation suggests so—all bloom filters are generated in one go, with a single function call in code.

And, well, yes, I would suspect that there might be a measurable network-wide impact on it. Not necessarily large, just like statistically significant is different from significant.

Aleksman4o · May 3, 2025, 8:50pm

I think with hashstore it should’t impact on the node performance. Moreover, nodes with hashstore should have more ingress while other nodes working with bloom filters (in theory).

Toyoo · May 3, 2025, 8:57pm

In general, probably yes, but I have my own reservations about the hashmap implementation.

Roxor · May 3, 2025, 9:16pm

The effects of bloom filters being sent out seem to take 8 hours or so. Perhaps each filter is sent as it’s completed… but the whole job takes hours to complete?

(Edited to add the quote, so you can tell what I’m talking about)

Toyoo · May 3, 2025, 9:33pm

Sorry, not sure what are you asking about?

MarviBiene · May 3, 2025, 10:23pm

Yes, but everyone is telling, that they’ll notice it, that a bloom filter is processed. So, would it be possible that a customer would notice it too in a “bad timing” scenario? Because for that time all hard drives are 100% utilized and has less room for writes

arrogantrabbit · May 4, 2025, 1:25am

My drives are not 100% utilized neither during bloom processing nor piece store filewalker. The bottleneck is a single thread on a CPU, not disk subsystem.

I would expect this to be the case on most real setup where access to metadata is accelerated.

So I don’t think there would be any different felt by customers.

Ottetal · May 4, 2025, 1:56pm

Yo B Rabbit,

I don’t think I’ve ever asked - but are you running a pool of disks, or standalone disks?

I know you run with accelerated metadata, which raises the question - how do you do that with a single metadata device, if running multiple pools?

Cheers

Aleksman4o · May 4, 2025, 2:17pm

With persistent L2arc you can use single ssd for all pools (each with own partition). If it dies - you just insert new.

Roxor · May 4, 2025, 2:29pm

Yeah same deal with special-metadata (though you really need to have a mirror). Two SSDs that you just carve same-sized partitions off of… then start adding them as pairs to be special-metadata mirrors to your pools. Works great!

Yes that means if a SSD fails… you have to carve the new one into the same set of partitions, and “zpool replace” each of them individiually. But that’s no great burden.

arrogantrabbit · May 4, 2025, 2:37pm

Single pool for everything, single special device, no L2ARC.

Ottetal · May 4, 2025, 5:59pm

Oh, I did not know that was possibe.

Did you format the special devices into the amount of partitions that you have disks? What happens if you need to add/remove a disk - can you then add/remove space in the special meta device?

What happens if your special metadevice dies?

arrogantrabbit · May 4, 2025, 6:21pm

You mean @Roxor’s suggestion?

Yes, it’s technically possible, you can create a special device using partitions on the single SSD mirror instead of a whole disks. But it’s highly discouraged, both from reliability and performance perspectives.

I would not recommend doing it.

Special device loss equals pool loss. That’s why it’s usually recommend to use the same fault tolerance as the rest of the vdevs. For example, my pool has three radiz1 vdevs, each consisting of 4 drives; metadata device is a mirror of Intel P3600 PCIe ssds. They have roughly the same MTBF as the disks, so the risk of loss of a special device is roughly the same as risk of vdev loss. It’s balanced.

You also want special device to be large enough to fit all metadata and still have space left — to minimize flash wear. Then if you manage to snatch a really huge SSD on eBay you can configure your datasets to send small files up to specified threshold to special device. This essentially gives you the best of both worlds: small files, where latency determines performance, live on SSD and don’t occupy much space because even though there are a lot of them they are small. Large files, where sequential throughput impacts performance — live on HDD. That way each type of storage is used to perform what it does best.

Roxor · May 4, 2025, 6:51pm

I’d quibble with the wording . ZFS recommends it controls the entire device: no matter what feature you’re using or workload you have. I wouldn’t say it’s “highly discouraged”. And certainly not for performance reasons: SSDs are monstrously fast… even moreso at higher queue depths… and Storj workloads are glacial coming in from the Internet.

As for reliability: yes: if a pair of SSDs die that are metadata-devices for multiple pools: those pools are toast. But SSDs have 1/10th the failure rate of the HDDs they’re caching for… and in configs where you have say 24 HDDs in a 4u chassis… it’s unrealistic to also have 48 mirrored SSDs to cover them all.

You definately have to understand how things fit together. But I’d estimate when you had something like 24x12TB’s for Storj nodes… a pair of 2TB M.2/U.2’s could comfortably cover metadata for them all (and only reach 75%-full, worst-case). Tons of unused space left for endurance and sustained write speeds.

arrogantrabbit · May 5, 2025, 3:01am

Perhaps.

Correct. But to consider deviating from a recommendation one must have a very good reason. Doing so without a very good reason is “highly discouraged”. And I don’t see any reason here.
Any non-recommended step adds another hole in the swiss cheese layer. Why increase risk needlessly?

Think of the whole scenario: having a collection of small independent pools as opposed to one large pool containing the same collection of dives. I don’t see a realistic use case where a collection of multiple single drives will be used in production, leaving enough io headroom to also host storj. If that performs – single pool will also perform, with more head room due to IOPs aggregation. Performance penalty comes from partitioning IOPS budget rigidly between drives – this allows one drive to get overwhelmed while others are idle. Pooling all disks spreads IOps evenly.

On the contrary, according to data sheets, e.g. Intel DC P3600 has 20% lower MTBF compared to e.g. Seagate Exos X20. (2M hours vs 2.5M hours). They are of the same order of magnitude: it would not have made sense to produce much more or much less reliable devices. Nobody wants neither regressions nor throwing money down the drain.

Precisely my point. Fixing bad idea with another bad idea is not a winning strategy. Having 20 pools each consisting of 1 drive and then solving performance issues by partitioning an SSD is just that, continuing the descent down the slippery slope of bandaids.

Instead, hard drives need to be assembled into reasonably and appropriately for the workload sized vdevs, collection of those vdevs put into a single pool for load sharing, and an accelerator devices added, only if the pool performance is still not satisfactory. This would yield better system and avoid going against recommendations.

Roxor · May 5, 2025, 4:20am

Setting aside the Storj recommendation of one-disk-per-node: If I combined the same SSDs, with the same HDDs, in another way: none of the hardware becomes more or less reliable. If I’m running the same nodes, their load is the same. And…

…and pooling with-no-parity dramatically increases the impact of failures. Why increase risk needlessly? The workload rewards capacity. If one drive failure can take out only one-of-many small pools: why let it take out one large one instead?

I’d point to overviews like this. The reliability portion is around 13:40. My experience isn’t statistically significant: but I’ve lost way more HDDs than SSDs. I’ve got Intel X25-M’s from 2009 still humming along: but every HDD from that time is long dead. But yes I know you could find someone else that says the exact opposite

I don’t see it as a bad idea. You take the recommended one-disk-per node: and move just the metadata to mirrored SSD, so all filewalker/gc housekeeping is effectively instant. The raw iops of the HDD has always been sufficient for regular uploads and downloads, and needs no help.

Maybe hashstore is going to make using flash a waste? I don’t mind using a pair of SSDs now: but there will always be new projects that could use them instead…

arrogantrabbit · May 5, 2025, 4:59am

This is not violated by having fifteen nodes on a pool that contains fifteen drives.

Yes and no: performance suffers due to rigid partitioning. Wear may increase for the same reason. Artificially imposed constraints cannot make things better, but can make things worse.

Nobody uses pooling without parity outside of some exotic scenarios. Usually, when every last ounce of performance is needed at any expense. Such pools are not suitable to run storj.

Aslo, we already established that we don’t design for storj; we design for what we need, and stoj gets to tag along. If you do design for storj – get a raspberry pi type computer, tape an HDD and SSD to it, and call it a day, to barely break even if your electricity is cheap.

Once you go to multi-drive-bay server territory – idle power consumption is too high to justify running it just for storj. And if you are not running it just for storj – most people would have redundant pool, because that’s what industry moved to: clusters of crappy devices can achieve much higher reliability to cost ratio compared to single devices.

Watching 20 min video, analything their methodology, and pinpointing flaws, etc, for free just for the sake of argument is a time investment I cannot afford to make. I’m comfortable designing the systems based on vendor datasheets.

Why not create one big pool then? This is only a good idea if you only use the server for storj. Which in itself is not a good idea.

I’m going to make a public prediction: hash store will be rolled back in 6 to 12 months timeframe as a failed experiment.