Massive reads on startup

arrogantrabbit · June 5, 2023, 3:38am

This won’t matter; fetching data from SSD does not benefit from caching in this scenario, it’s fast enough. See that 4K iops — that’s what happens without cache. SSDs can sustain that forever.

Exactly. Solution is an SSD. Not a new filesystem.

arrogantrabbit · June 5, 2023, 3:40am

Very good point. Which means we shall optimize for “random small IO on a massive datastore” in general, not specifically for current use patterns.

jammerdan · June 5, 2023, 3:43am

This should absolutely not be hard-coded and rather run when it is appropriate over the course of a day. As I have suggested, if the files that get moved into trash get stored in a database with their modification times, any future deletion day can be calculated from that and then it is only a query to the database once a day which files to delete which can be put into a low priority background task for final deletion over the course of a day.

This hard-coded sounds also weird. My gut feeling is telling me this should be a setting controlled by the satellite operator as he knows what promises he has made to his customers about deletion, might be even a regulation issue (GDPR) and it is also about the technical competence of the satellite operator (Some might need a longer period to safeguard their data).

arrogantrabbit · June 5, 2023, 3:48am

But scan is not the usecase that needs optimizing. If metadata lookup requires HDD IO — it will be slow, regardless of how efficiently or well it packed.

Keeping metadata in ram is not a goal. Providing fast access to metadata can be accomplished by SSD (cache or special device).

And for why not — I’d say there is no point in running a node that does not win races.

For this specific usecase yes, maybe, at huge cost designing new filesystem or you could add larger SSD

No, everything is default. I did not touch the config file beyond setting path to storage, size, and URL.

This is not the case with ARC (and most other cache solutions — including the one Synology was using) — sequential io would bypass it; and won’t be a problem when reading from SSD in the first place.

My whole point being that accelerating random io with an SSD is a cheap solution here, not filesystem redesign or tweaking specific usecases. This can be done with block level cache with a variety of filesystems; ZFS happens to provide finer controls in the form of special device, is all.

snorkel · June 5, 2023, 2:50pm

What if you can’t use a SSD? My 2-bay NASes don’t provide this posibility. An optimised FS would be great, if is made to optimise workload on a HDD with limited RAM, and no aditional SSDs.
When you build systems for a specific scenario, optimising the components, hardware and software, for that specific scenario could prove very beneficial in the end, but, each variant has pros and cons. For ex., the FS in a system built for Storj nodes:

using an existing FS:

is already available, no material investment in developement.
is more or less widely adopted, used and understood, with many specialists that know the specifics.
the Storj developers and node operators use the same thing, in testing and production, so bugs are quickly squashed.
is built for general use cases, so not optimised, or needs tuning limited settings, for storagenodes.
could hit limits that noone can surpass with tuning, when running bigger storagenodes.

building a dedicated FS:

optimal for Storj, but not so much for general use.
needs time and effort from the system builders, os devs, etc to be adopted, if ever…
specialists would be fewer in the begining.
material incentives could be close to 0. The ones that are the more interested in it, the SNOs, would not pay for it, I’m pretty sure.

Toyoo · June 5, 2023, 8:31pm

But it’s fine if it is slow for data that is not frequently accessed—like backups that are never retrievied. So why bother caching it?

With the recent payment proposals providing fast downloads will not be as incentivised. Besides, there were a lot of voices stating that they can be profitable even from just the storage revenue.

You probably have missed that, but I wrote above:

so it’s not like I deny it. All I’m stating is that there is an opportunity to do even better than that…

I assume that if/when there are operators providing tens of petabytes of storage, the savings on RAM/SSD cost might make it viable to actually work towards a bespoke file system. Or, if Storj Inc. sees the value in having thousands of non-professionally operated nodes with no caching solutions be faster. Or, if there is a hobbyist willing to put effort just for recognition. For example, I think it could be a nice final year project for a university student…

So, essentially we mostly agree in semantics, we just use different words: what you call a solution I call an acceptable workaround.

Right now, you have several choices. Replace your NAS with something that allows you to add an SSD, add a USB-based SSD, expand your RAM, use a file system that is light on metadata (like, carefully setup ext4—as I wrote above, it was fine for me to run 25 TB worth of nodes on a 16 GB RAM machine), put effort towards building a custom solution, or accept the fact that you will be losing races and settle on just storage revenue.

snorkel · June 5, 2023, 8:48pm

I think this lost races are overrated. Some real life data:
A. machine 1, Synology 1GB RAM, 2 HDDs, ext4, both nodes still filling, total eom 9.25TB in march, add-in 1.58TB, disk avg 8.46TB, egress download 1.009TB, earnings 34.09$.
B. machine 2, Synology 18GB RAM, 1 HDD, ext4, node still filling, total eom 9.97TB in march, add-in 1.59TB, disk avg 9.37TB, egress download 1.040TB, earnings 36.32$.
So, as far as I can tell, there is no significant difference between 1GB and 18GB RAM machines, with no SSD and ext4 FS.
They both loose and win races, and in the long run, somehow there is a compensation between lost and win races, keeping performace equal for both.

SGC · June 7, 2023, 6:22am

well they are special small block devices… so yes i write 512b to get more room… i don’t seem much unexpected wear… i’m seeing anywhere from 10MB/s to 100MB/s but my pool also host a lot of nodes, so nothing unexpected there.
yes my other device is also an intel DC series a DC P3600 1.6TB with 44PBw life… i switched to that because of the extra wear and its a champ…

my other ssd is an old swap disk… lost half its life on a bad week of server swapping, before it became a special vdev, so can’t really blame storj for it… also i think i made the mistake of configuring it to do atomic writes because it sounded good in concept… but it wasn’t lol

got a replacement DC P4600 3.2TB that will replace my worn out ssd in the near future.
its finally full so writes have also been slowing down… lol getting more balanced between the multiple special vdevs.

i try not to use it… it can lead to file loss / corruption on power loss

arrogantrabbit · June 7, 2023, 6:40am

Pretty much all nand flash devices have 4K blocks. Some of them pretend to be 512e (e=emulation) and attempt to within reason batch requests to pretend to have sector sizes like on old hard drives while at the same time trying to avoid read-modify-write on the flash. However, this does not change physics, and unless data is streamed more or less sequentially, every 512 byte write ends up being 4K read, 512 byte replace, and 4K write. This is called write amplification, and while all your OS tools show you apparent write rate, you are using up write endurance of your SSD 8x faster for no benefit in return.

Setting correct ashift does not waste space, it just tells ZFS how to batch requests for best performance and it also so happens this results in less wear.

Yes, it can, but storj does not care and in return you get massive performance boost. Also, power loss almost never happen for this to be an issue to begin with…

SGC · June 7, 2023, 7:30am

i do run ashift 12 on my pool, mostly for hdd cross compatibility between 512B, 512E and 4Kn.

the special vdev is to to take blocks below 4K in size so running them at 4K, would mean they would get filled much much faster since the majority of the files are like 1Kb in size until they grow and is moved to the HDD’s instead.

also zfs actually writes 8K compression just makes it often fit in a 4k sector.
ashift just defines the minimum size of its sectors, it can also write 8k

the special vdevs can be individually configured outside the native pool ashift, for better capacity usage of the SSD’s

so i fully agree with what you say, i understand and have it handled…

my disk that is dying is just not made for what i’m using it for, plus i did some bad configurations for it and it has been doing a lot of internal writes handling its data which causes excessive wear.

the bad special vdev PCIe SSD is a 2012 design, so basically first gen PCIe based SSD.
my DC P3600 is still only at 1% wear after nearly a year, so its all good.

i have no unexpected write amplification.
pretty sure its mostly internal data being rewritten on the bad ssd due to me setting it to atomic writes… WOOPS

and haven’t had a replacement nvme ssd to fix it.

Alexey · June 8, 2023, 3:49am

A post was split to a new topic: The bad ssd due to me setting it to atomic writes…

EasyRhino · July 25, 2024, 5:43pm

It’s not the real point of the thread, but I found this old count when I was searching for a ballpark of how many files I was going to need to copy when migrating a node to a new drive.

although now (july 2024) the files may be smaller. moving 5.8TB of data was approximately 32 million files.