[Tech Preview] Hashstore backend for storage nodes

Do you remember that one big customer we simulated on SLC? If not here is a short recap. That customer uploads a high amount of small and medium size files with a short TTL. It turns out storage nodes can handle the initial uploads but things gets expensive when the TTL kicks in. The storage nodes simply needs too much time deleting such an high amount of pieces.

The solution for that problem is the new hashstore backend. Instead of storing all pieces as a file on disk the hashstore backend combines them into 1GB LOG files and keeps track of the position with a hashtable. Pieces with a similar TTL will be stored in the same LOG file.

Before I tell you how to enable it let’s start with a fair warning. The hashstore backend comes with good test coverage but that doesn’t mean it is free of bugs. Last weekend multiple of my storage nodes got into a situation that would get them disqualified. I was just lucky to notice it in time. For now please don’t risk all of your nodes. Maybe start with a smaller test node that you can effort to lose. Let’s build up some experience first before we start to migrate all nodes. Also feel free to ask any questions while we are on it.

Ok now that you know the risks here is how you can enable it. First you need the new v1.119 release. First time that new storage node version gets a request from a satellite it creates a hashstore folder for that satellite and a bunch of config files. By default all migrations will be disabled and the new hashstore folders will stay empty while the storage nodes keeps writing new pieces to the old piecestore backend. So not a lot of hashstore activity unless you enable it. Here is my way to enable it (to prevent copy paste mistakes this code snippet keeps all migrations disabled):

# Change this path to match your storage node location
cd /mnt/sn1/storagenode/storage/hashstore/meta/

# Enable passive migration. (requires version v1.119)
# WriteToNew will send all incoming uploads to the new hashstore backend
# TTLToNew will send only uploads with a TTL to the new hashstore backend
# ReadNewFirst will migrate any piece that gets hit by a download request
# Not sure what PassiveMigrate does. By the time you think about this one you most likely have them all set to true anyway.
echo '{"PassiveMigrate":false,"WriteToNew":false,"ReadNewFirst":false,"TTLToNew":false}' > 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6.migrate
echo '{"PassiveMigrate":false,"WriteToNew":false,"ReadNewFirst":false,"TTLToNew":false}' > 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S.migrate
echo '{"PassiveMigrate":false,"WriteToNew":false,"ReadNewFirst":false,"TTLToNew":false}' > 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs.migrate
echo '{"PassiveMigrate":false,"WriteToNew":false,"ReadNewFirst":false,"TTLToNew":false}' > 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE.migrate

# Enable active migration (requires version v1.120)
# Will take multiple days with high CPU time.
# You can enable it for all satellites at the same time.
# The migration will run through them one by one.
echo -n 'false' > 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6.migrate_chore
echo -n 'false' > 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S.migrate_chore
echo -n 'false' > 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs.migrate_chore
echo -n 'false' > 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE.migrate_chore
16 Likes

Thank you so much for that.
Interesting changes afoot!

Much like the Badger Cache, I’m steering well clear of this until it’s ready for production as I wouldn’t be knowledgeable enough to debug any errors.

Good luck to the intrepid explorers! :wink:

4 Likes

But wait! Didn’t you hear that you can tune your node…

…and risk disqualification…

…to handle a load Storj has only experienced for about 2 months…

…out of years of operation…

…and isn’t happening now…

…and may never happen again?

Where’s your sense of adventure?!?!? :wink:

3 Likes

First hashstore node up and running. No risk no fun… :laughing:

5 Likes
  1. When the migration in on, the migrated pieces are deleted from the old piecestore in the same time? Or at the end, after the migration is finished for all the pieces?
  2. If the deletion is in the same time, is there a recover mechanism to the old piecestore, if something goes wrong?
  3. If the deletion is at the end, you end up having 2 copies of each piece that could run out of storage space - I believe this is not the chosen way.
  4. Is there a test for success migration?
  5. How big can a hashtable or whatever runs in RAM be for a 20TB node? If there is not enough RAM, running the hashtable from disk would introduce big performance penalties?
  6. Are there more IOPS on disk or less than the old (actual) piecestore?

I’m thinking the steps would be: copy the piece from piecestore to log file, make the hash index thingy (I’m strugle to understand these technical stuff :slight_smile: ), test the hash and the success of migration, delete the piece from old piecestore, tell the Migration Manager that the migration is success and the deletion finished, and move to the next piece.

I would reccomend to limit the migration somehow to 10% of storagenodes for at least a month, just to see the production problems.
For SNOs, maybe migrate one satellite at a time, and watch it for a month or so, before going to the next one. SL > AP1 > EU1> US1 would be advisable.

yes

no, it’s a one way ticket

not tested. Perhaps there are bugs. If you found them - please - report, preferable as an GitHub issue with steps to reproduce. However, the forum report is ok too.

it should be, but unlikely you can run it manually.

not enough stat so far. From what I have seen - it uses less RAM and put much less load on the disk (except, maybe a compaction, but it’s not tested fully yet, so, please report).

It has significantly less IOPS, that’s the goal :smiley:

We do not plan to migrate it forcibly. So, every SNO would decide for themselves. It’s not easy to enable for an average Joe.

:+1:

1 Like

Yep implemented like that. If you want to look it up the first intersting line might be this: https://review.dev.storj.io/c/storj/storj/+/15388/20/storagenode/piecemigrate/chore.go#273

Oh yes there are lots of unit tests. It doesn’t change the risk that there might be edge cases that aren’t covered by tests. So its better to assume there are no tests at all and just verify everything ourself. Lets start with a small test node and see how that goes.

Additional notes: hashstore has some write and storage amplification by design.

Compaction may not happen for log files which stores less than 25% garbage. So you may store 0-25% unpaid data, in addition to the trashed data…

I guess These thresholds are configurable?

I may be using incorrect assumptions… but an average SNO seems to normally see 5-10% trash steady-state, and 0-10% unpaid-left-from-bloom-filters… so say up to 20% of a full HDD could be unpaid (which is fine: it has been like that for a long time).

If so, then a hashstore config on the same full HDD could mean an additional 0-25% lost to logs that haven’t hit their compaction limit yet? Which could mean in the worst case 20-45% unpaid?

Then it wouldn’t be unfair to say hashstore could mean up to double the unpaid space (compared to how we store data today)?

Who cares? A new 24 TB disk will die anyway long before it ever gets filled. :money_mouth_face:

3 Likes

This is bad news… but why is there 0-25%? Why not a well known percent? Isn’t something configurable?
To trade space for walkers speed is not a preferable option for me.

It probably doesn’t run compaction on logs until they hit 25% junk. So every log is probably slowly climbing up to 25%… gets cleaned to 0%… then starts again. (just a guess)

@zeebo An idea. Could we introduce an additional parameter that triggers compaction when the total size of garbage across all log files exceeds a configurable percentage of the free space available on the node? When this condition is met, compaction could prioritize the log files with the highest proportion of garbage until the total garbage size drops below this threshold.

This would allow nodes with plenty of free space to avoid unnecessary compaction while enabling nodes nearing capacity to recover additional space more aggressively.

4 Likes

hah… ehkem,
i mean:

who would have thought.

“Run a full filewalker”,
“restart a node”,
or
“if You can’t keep up, its not for You”

i dont know if its revelant enought, but
I just want to say after many months of experimenting i come to a solution in windows environment on how to make a node FAST in terms of managing small files under STORJ’s service.
i could run a test node to show off, if You need.
It’s simple, no need for any caches, but i haven’t got chance to try it on times when big testing took place here, because i would have to change my mothership machine’s OS system, but i’m pretty sure i can make a windows node that is able to count or delete small 2KB files fast on a 16TB, 7200 disk or 4TB 5200 old pricks, whatever. Like deleting 6TB in like 1h max or something, idk test would show exactly, its big deal for me bceause u dont need spaceX CPUs or huge amounts of RAM to do so. Also easy Windows Gui and no time wasting linux command typing all over and over again… just wasn’t my priority personally to coninue the works…all its need its modern OS, no cache, no mumbo jumbo expensive hardware, and just modern virtualization software. Yes it could be super fast on VM’s. And i mean there was no software in 2019-2022 to enable this, only since now, i guess, im not even sure if linux is superior to windows in this matter too, in fact i was reading and testing so much that i already forgot some conclusion about linux, (no documentation or notes, as it was pure amatuer project to speed up my nodes over 2022-2023)
If You need i can run a special node, on some 1,115 or whatever 1,117 version, all on 10-15 years old machine, You upload me like 10TB of small test files, then delete 5TB and we will see.

No, it’s instead. This threshold only happen for hashstore, if you still have piecestore storage, it would have an own overhead.
It’s not plus, this is like a half has 20% and the second has 25%, so you need take an average, i.e. on average it would be (20%+25%)/2 = 22.5% for the half-migrated node.

You may do so:

or

The same benchmark is exist for the hashstore backend too (you need to run it with -backend hashstore option).

How are the different type of pieces registered in this new approach?
I mean stored pieces, deleted pieces, trashed pieces?
Is there a trash log were the trashed ones are copied, and deleted from the main logs?
Or you use a markup for each type, and the pieces stay in place in the main logs?

Another thing that pops into my mind… if you have an unsafe shutdown and the file system gets corrupted, you can loose the entire stored data, because instead of having millions of recordings from which you could loose a small percent, now you have a few log files that can be lost instantly if the fs is damaged. Maybe some fs backup should be put in place…

You need to read a blueprint: https://review.dev.storj.io/c/storj/storj/+/14910/25

and perhaps all others… https://review.dev.storj.io/q/hashstore

so it will be available to choose store size? because for every HDD size it will be logical different size