Hashstore rollout commencing!

arrogantrabbit · August 21, 2025, 10:28pm

This is not a typical usecase: the intent is to share already online capacity, and virtually all online capacity today is on a redundant arrays. There is no usecase for a standalone disks anymore, so you must have brought up that capacity just for storj, in contradiction with the project motivation.

Roxor · August 21, 2025, 11:30pm

My man, c’mon…

…not only is that not true in any general sense (“virtually all”, really?): it’s definately not true for the SNOs participating in this project, and who’ll be part of the migration.

I agree this is just a software upgrade, no big deal. No reason for flo82 to get Greta Thunberg on the phone and start causing trouble…

nerdatwork · August 22, 2025, 2:30am

Alexey · August 22, 2025, 3:54am

Not necessarily. I have a home “server” which was a media server, game server and my lab. It didn’t have a redundant storage, for that I use Storj. So, it has several half used disks for my stuff, all remained are shared with Storj. But they are single disks on NTFS (yes, I know, but this HW is incompatible with Linux, I didn’t try BSD though).

However, I should confirm, one of the nodes was specifically built for Storj, it was a raspberry pi 3B, but I got 6x ROI from it before it’s stopped to boot (the SD card, and this is a remote location, so cannot fix at the moment).

jammerdan · August 22, 2025, 4:22am

For the migration some kind of progress information would be nice. I think we know how many pieces a node is storing and it should be possible to count the number of successful migrated pieces. This could be displayed or even a percentage could be calculated.

Alexey · August 22, 2025, 4:46am

You can measure blobs store and hashstore, no code modification is needed.
You would also have information in the logs and on the debug port.
But we of course would accept PR from the Community.

snorkel · August 22, 2025, 8:32am

That tool to repair hashtables and whatever it should be included in the storagenode software and should be started whenever it is needed automathicaly by the node. The node should run as autonomous as possible, without operators keep pampering it. Imagine if you are in a holiday. You return and your node is gone because the node dosen’t know to autoheal itself.
I rent my space to Storj, Storj can do whatever it wants with it, including maintaining and repairing it.

elek · August 22, 2025, 9:44am

The node should run as autonomous as possible, without operators keep pampering it.

There are autohealing methods implemented in hashstore (like truncating back to the last valid position).

Manual repair is required in case of hardware failures, when manual intervention is expected (eg: replace your hard disk).

ACarneiro · August 22, 2025, 11:58am

You’re not a customer, though. You’re letting Storj have some of your resources for them to use in a way that advantages both you and them…

With respect, they know better what works and what doesn’t and this is the way Storj has decided to go. The reasoning has been very thoroughly explained in this announcement. We don’t have the knowledge they do and ultimately we don’t get to decide Storj’s technical roadmap.

MarviBiene · August 22, 2025, 12:29pm

Why are so many people afraid of the change? If this was tested on the select network on high load, I think they know what they are doing. I don’t think that they would get rid of this if this would cause troubles. This is something I just know from old guys, that they don’t want anything to change “because it works”. But that it could possible improve things is something that they don’t want to see.

Everything have its ups and downs. And if Storj says that hashstore is better, I trust them. They are the ones who are giving me money for literally doing nothing. And I like the speed of hashstore, because IO load is way better than before. At least for me

snorkel · August 22, 2025, 1:03pm

Me too I welcome change if it’s for the better, and I’m not affraid of this upgrade because could broke something. It’s been tested and it works… BUT… on my nodes that I migrated to hashstore, the accumulation of data is 3x lower than the nodes with piecestore, so I want to delay it untill I have time to do some more tests. I will migrate at my own pace, one node at a time to see if it’s a general problem or just those old nodes.

Mitsos · August 22, 2025, 2:35pm

We are afraid of the change because we are old and grumpy, not because a bad sector can corrupt a 1GB file instead of a 1KB file.

Now that we have assurances that careful design considerations were taken to prevent such corruption (within reasonable reason) and that the tables can be rebuilt instead of the node being lost because one of them was corrupted and no tool was available to rebuild them, we stopped being old and grumpy. We are now young and full of thirst for an adventure.

jtolio · August 22, 2025, 2:50pm

Pentium100:

Rewriting cold data. Under the current system, if a piece was uploaded a year ago and the customer did not delete it, it stays in its file untouched. The file may be read (the customer downloads his file or there is an audit), but never written. A power failure or system crash will nor result in a loss of this file (unless the file system gets screwed up beyond repair). Under the new system, old data will be periodically rewritten and while I am sure the node software first makes sure that the new file is actually written (using fsync and whatever else) before deleting the old file, I would not want to be the first to find out that if the power fails in just the right moment the piece disappears.

The second part is the migration of current data to the new system. Again, why disturb the old data, but the bigger problem will be the process itself, copying 17TB of data from the old system to the new system will result in a huge IO load on the server with lots of writes. Not looking forward to it. Especialy not during a zfs scrub, because if I cannot choose when it starts, then it will be during a zfs scrub or when I need the server for something else.

While hashstore will rewrite the direct piece data potentially more than piecestore does, you have to understand that piecestore is also causing many disk writes and rewrites every time a directory inode changes (pieces are added or removed). Filesystem updates are not free, and in some cases (especially with large directories) they can be quite costly. I actually expect that hashstore will overall cause less disk writes if you account for filesystem inode writes and other filesystem bookkeeping, in the general case. Over time, in hashstore, pieces will get compacted into relatively stable logs that don’t have much change to them and won’t be rewritten much. Depending on the customer workload, this may happen immediately. Our reason for having expiring data go into different log files is for exactly this reason, we want to store long lived data together and short lived data together separately.

Also note that, because the data format the data is stored in is much easier to query for reads or for usage statistics (and does not require regular walks over the filesystem), the wear and use of the drive will be substantially lower.

Good call out @Toyoo! (and also, thank you for your earlier designs and plans, we definitely considered them!).

To be completely frank, the lack of timeline is because we just don’t know! Our Select rollout took a long time, largely because it was the first rollout and we had to stop and readjust multiple times. This rollout is starting from a much more robust starting point.

Certainly, just to give a ballpark, we expect the first rollout for getting all new data written to hashstore to take weeks to a month. We plan to do roughly similar to a version rollout, which takes two weeks and uses an doubling amount of nodes per rollout step, but per Satellite group. We’ll probably do saltlake first, then eu1 and ap1 together?, then us1. But we might pause depending on forum feedback, telemetry feedback, etc. Realistically we’re expecting the full rollout will be a month or two, just for new data. Migrating all existing data is going to take a lot longer.

That said, I might be surprised and all our planning and prep might pay off and this will go so smoothly it won’t make sense to wait so much.

We will make forum posts from time to time to share the progress of the rollout and how far we are.

Mitsos · August 22, 2025, 2:50pm

I can see satellite directories being created in the hashstore directory, but I haven’t yet got around to changing the hashtable location (to put them on SSD). Will there be any instructions on how to move the tables (and what actually needs to be moved) to SSD, in case someone missed the opportunity to declare the config variable for them?

(ie for databases move *.db from storage/ to ssd, change this config variable)

jtolio · August 22, 2025, 3:07pm

I think it’s safe enough to shut down the node, move the “meta” folders (there could be many within the hashstore, so make sure to find all of them) into the folder you want with the same tree substructure, set the config to point to that folder, and restart. Once you restart, make sure the node didn’t wake up and decide there was an empty hashstore and put new empty tables somewhere else (to just double check you didn’t misconfigure it). If you did, just shutdown, delete the new empty folders, fix your config, and start again.

Also, just so you know, you don’t need to use the config options. You can also bind mount the hash table directories elsewhere just fine, so same deal there, shut down, copy/move the existing data to your new location, bind mound the new location to the old. Note that there are many hash tables actually, each hashstore has two (s0 and s1), and there is a hashstore per Satellite.

Now that I’m writing this all out, perhaps we should make a tool.

Roxor · August 22, 2025, 3:11pm

I like how you did it with the database location: a config param to make the name of that location active:

storage2.database-dir: “dbs”

…then a mount with the same name:

volumes:

- /path/to/somewhere:/app/dbs

The same thing for the meta folder would be great!

Mitsos · August 22, 2025, 3:12pm

Does the “meta” include the top one that’s outside the satellite subdirectories? (the one that contains the files to enable hashstore migrations)

EDIT: answering my own question: no the top one (hashstore/meta) does not need to be moved, only the hashstore/(satellite)/s*/meta ones

Mitsos · August 22, 2025, 3:14pm

# path to store tables in. Can be same as LogsPath, as subdirectories are used (by default, it's relative to the storage directory)
# hashstore.table-path: hashstore

hashstore.table-path: “/mnt/somewhere/node1/hashstore”

should do it

KernelPanick · August 22, 2025, 7:18pm

Appreciate the compliments. One of my oldest nodes is dated, 3/31/20. I don’t think I was one of the early ones, but i jumped on V3 as soon as i could. It’s spanned at least 3 different CPUs, OS configurations, and software RAID, including migrations. I’m one of those that ended up with 128GB ECC RAM, ZFS with SLOG, and it’s been running great. I’m very proud to say I’ve never had a single node fail, and that storj is the catalyst that propelled me to dive deep into docker, and container sysops.

Throughout my career i’ve had numerous issues with performance with “many small files” vs “fewer large files” Definitely great to hear of this shift from CPU bound to IOPS. I feel this is key to unlocking the ability for the network to scale exponentially.

I’ll be back into the job market as well soon if anyone is hiring IT system Operations specialist that is willing to pivot. I’m learning n8n, gitops, ansible, and terraform to make my infrastructure as resilient and automated as you all have done with the storj nodes!

Pentium100 · August 22, 2025, 10:32pm

Filesystems like ext4 usually can handle power failure during write, the directory structure remains intact. Let’s say I am copying a bunch of files from one drive to another. If the power fails or the system crashes, some of the files would not be copied, some may be partially copied, but any problems would be limited to the files being copied. No old file on the destination drive should disappear.

Maybe, the new system would likely not be significantly faster on my setup (though I admit my setup is probably not very common).

However, the current filesystem walk is mostly read-only, so it can be cached. The new system will do periodic writes (copying from one big file to another). I am also not looking forward to the copying of 17TB of data from the old system to the new.

Honestly, if it works, then great. Just that when I read about the new system I get flashbacks about Storj v2 and its storage method.It may be an unfair comparison, but I can’t shake it off.

I hope there won’t be problems with me being the last one to start using the new system