My node was permanently DQ’d because Storj’s hashstore migration corrupted its own files.
The system punished me for Storj’s backend bug — not for downtime, not for hardware failure.
This reflects a wider pattern in how SNOs are treated.
I’m leaving the network.
Why I’m Leaving Storj as a SNO
After years of stable operation, perfect audits and reliable hardware, my node was disqualified only after hashstore migration corrupted Storj-generated files (broken log segments, invalid tables, unreadable checkpoints, etc.).
This is the same corruption many SNOs have reported since hashstore rollout.
Despite this, the response pattern was familiar:
“Your hardware.”
“Your filesystem.”
“Delete the corrupted file.”
“Not our bug.”
But the timeline is indisputable: Healthy node → hashstore migration → Storj-written corruption → audit failures → irreversible DQ.
Punishing SNOs for platform-created corruption is not decentralized, fair, or community-driven.
Combined with:
aggressive moderation of criticism
dismissive responses
declining rewards and rising risk
zero recourse for DQ even when caused by Storj’s own backend
a culture of deflecting issues back onto operators
…it’s clear that SNO trust is no longer valued proportionally to the responsibility we carry.
I’ve supported Storj for years, but after being DQ’d by their software and held responsible for their bug, I’m stepping away as a node operator.
Respect to the honest SNOs still trying to help each other.
The community deserves better.
It’s your first post on the forum, we didn’t have any support requests from you so far.
We analyzing every case individually, because there is no known software bugs in the hashstore backend, but there are could be bugs.
Do you want to try to figure out what was wrong in your situation or maybe help to find a bug (in which case there is a possibility to reset DQ)?
Known issues:
some pieces in piecestore can be corrupted or have zero size before the migration, but during migration to hashstore they become visible and this may result in disqualification; this case was just a time bomb and likely would lead to the same result with piecestore after a time;
the problem with a free space calculation in the allocation, if the dedicated disk feature is not used; the free space is updated for the satellite on every check-in, every hour by default, so it could be possible, that the node can be filled up before the satellite would know that this node is full and this is more related to a recently increased amount of uploads;
hashstore hashtables can be corrupted due to abruptly shutdown or bad sectors or bitrot, but they can be recreated with the tool.
If you would like to help, please provide a full story. We didn’t enable hashstore by default and even didn’t start a migration, only new uploads are going to hashstore for some nodes. I’m not sure that it’s enabled for all nodes so far. Old pieces are still served from piecestore.
After successfuly migrating 18 nodes, with 5-10TB on each, I didn’t have your expirience, no broken logs, no corrupted pieces, so, at least for me, it worked flowlessly. I don’t have some overkill hardware, no SSDs, no RAID, just Synology NASes, Exos drives, ext4 fs and UPS. And no USB connections. Drives were all new, oldest ones from 2021.
The main error factors, from what I gattered on the forum, are USB connections, faulty SAS interfaces, network drives, no UPS, letting asside too old drives that are expected to die.
So, software is pretty reliable from what I can tell, hardware is the main problem. Of course there are occasionaly bugs, but their are dealt with verry quick. I find the dev and forum team verry resposive to our needs. This is in short my expirience of 5 years as SNO.
My experience with a node that has known corruption from previous HDD failure, was that the conversion process just converted what it found, skipping the 0 byte, and corrupted files. Audit stats did not drop during conversion.
The recent increase in repair work, has found many missing/corrupted files, slight drop in Audit score, no where near disqualification - yet.
We tested hashstore migration with PBs of data and on hundred of nodes before we started using it in public network. First hashstore nodes are started more than one year ago.
Active migration hasn’t been forced/started by Satellite so far to wait until all small (annoying but not critical) bugs were be fixed, like space calculate on the web UI.
I understand that this process (replacing the storage layer) can cause some frustration, and adds some complexity, but we believe it’s better for everybody, including SNOs.
In exchange of small (configurable) write / storage amplification it can server way more traffic, and puts less stress to the disks (no more walkers!), especially if they use the usual ext4 system. We would like to make the public network ready to more load and store, for the case of a new deal is coming in (no promise).
But it’s a long process, and we need to adjust based on learning. And as long as it’s a software, bugs can be there. We just try to avoid the fatal ones…
I am very grateful for all the feedback from this forum. I believe it’s not only the corruption bugs, but the annoying usability bugs which should be fixed. SNO operation shouldn’t be a rocket science, should be easy for every body
If you see any possible data corruption bug, please report it. Either on Github or here (we have wonderful forum moderators, all problems reported here (+ all ideas) are shared with devs quickly. + some devs are also following the forum).
The goal of DQ system is to punish cheaters and reward the reliable node operators. We can even talk about undo DQ (hard manual process, but possible), if the root cause of DQ is proven to be in the software, because the goal is supporting SNOs to have a stable network.
You read “patterns” of responses to hashstore issues. Then still decided to manually force a migration yourself anyways (because Storj hasn’t pushed it yet). And then you had problems… that you kept silent about and asked for no help.
And despite being a SNO for years: you then come here talking of dismissive responses you didn’t receive, moderation you didn’t feel, and culture you never participated in… because this is your first post. And it’s to say you’re leaving?
If you have problems, please tell the community when they have the time to address it. Not after you bail.
Storj treats SNOs well: you can see node count climbing for years. And this forum is full of helpful posts. I hope you still come check on Storj a couple times per year: maybe we can win you back!
Yeah, when I read about hashstore and noticed a few posts about problems with it, I disabled it on my node. I’ll let others step on all the rakes, so that maybe I’ll be able to avoid them. At some point my node will have to be migrated to hashstore (as I understand it), but hopefully by then all the bugs will be fixed.
OTOH, I can kind-of sort-of understand OP. Maybe he had problems with performance and read that hashstore would be faster, so he decided to migrate and ran into problems and got DQ.
But then he should have asked for help here or in support, then again, most public posts of the type “my node got disqualified” end up with the node staying disqualified and the OP being told that the problem was with his hardware or whatever (and it’s likely correct).
With storj you have parts of the hdd written years ago and rarely read since than. This can hide existing disk problems for a long time. With hashstore we have a full rewrite on migration and a frequent rewrite in normal operation.
I have seen some HDDs dying while or soon after migration as well but I don’t think this is a storj problem.
Thanks for the reply.
Just to clarify my situation:
During the hashstore migration several hashstore segments became corrupted (hash mismatches and missing log-segments). Following the recommendations shared on the forum, I removed the corrupted segments so the node could rebuild the structure.
While the hashstore was rebuilding, the node received audits for pieces that were no longer present, which resulted in “file not found” audit failures. That’s what ultimately pushed the audit score below threshold and caused the disqualification.
No hardware issues, no SMR disks, no offline periods — the DQ happened specifically during the rebuild process after deleting corrupted hashstore entries.
I just wanted to share the context so others know what can happen in similar situations.
Thanks for the help and for the time. I’ll be stepping away from running Storj nodes for now, but I appreciate the support from the community.
Unfortunatelly I didn’t find such errors neither on the forum error-codes nor in the support tickets. If it is still possible, could you please post the full error message?
Did you try to use a tool to fix metadata, like here:
or it throws errors during node start?
Also, did you remove hashstore logs or corrupted pieces from blobs?
After what event the corruption is happened? Was there some reboot or shutdown, like a power loss or a system crash?
What was a filesystem?
@elek do you know, if the hashstore log file is corrupted, can we recover readable pieces from it?
Also, how is it possible to get them corrupted during migration?
This recommendation worked for blobs. Storj will not notice a couple of dozen deleted fragments.
In the case of hashstore, you should never delete files, because they contains thousands of fragments, and here you will definitely be noticed and receive a well-deserved DQ.
We haven’t seen any corruption caused by the migration. We need the corrupted log file and the hashtbl to have a look, but I see higher chance for real disk corruption.
The only known risk is that we don’t use fsync for performance reason. The last few pieces in a log file can be zero if OS doesn’t write them out from the cache before a ungraceful shutdown (which supposed to be rare).
Jeff is working on an automatic fsck to make to avoid these problems. Next release will generate some “hint” files for each hashtable. Including the log files which should be checked (instead of checking all of them).
Piece-hash is included in the pieces headers, so technically it’s possible. We don’t have tool, and it really depends the corruption type. If only one or two pieces are affected, it might be easier to just keep the log file as is.
I don’t know who maintains the tool to recreate the hashtable. But in my case it was very often, that in the counting step of the recreation script, even though I deleted 0 byte files, one of the last file(s) was/were corrupted, so I had to delete them manually and re-run the script. Problem here is, if the node is pretty large, it will take a very long time, for the first counting step, in worst-case scenario, one of the last files is corrupted, and has to be deleted, than you have to rerun the whole thing. Sometimes counting gets through, but in the compiling step it crashes due to a corrupted file. It would be great, if the script would handle such cases by itself, or with an added flag when starting, so you can “fire and forget” on larger nodes. In my case it were the beginning of hashtable creation, so they were just several GB migrated to hashtable, but I can imagine, that a large multi TB node will take much longer an thus will give hassle.
I had several cases, where in the counting part, some corrupted files appeared, which I had to delete by hand. Even in the second step, when counting was done. I think if hashstore will be the main filestore system, the recreation script would need to be more user-friendly, and handle most exceptions by itself, since, there will be more people having to recreate stuff due to hard shutdowns, mom unplugs pc etc. (not everybody has an ups).
Personally, I think the hashstore should survive hard shutdowns and moms. And re-generation tool should be the last resort, not an every-day tool.
Hopefully the upcoming fsck tool will help (which records which files are closed and which are used actively, and next startup checks all the active ones…)
I don’t think this is a sane requirement. Attempting to satisfy it will lead to inefficiency (such as flushing all writes immediately, making everying sync) and won’t guarantee anything because if filesystem gets corrupted there is nothing storj can do. Some filesystems tolerate power events better than others but none is designed to preserve data, only internal consistency.