Ability to recover part of lost data / restore the node from backup

you might be right that the checksumming is not really needed for a storagenode and therefore zfs is an unneccessary overhead for a node.
However, there are other advantages too, like how zfs caches files in RAM/l2arc and SLOG. For me it makes a big difference even during normal operation and the long file walker processes. The caches are used extensively, increasing the response performance of my drives (some people mention that their harddrives slow down a lot during the filewalker process. I have 3 nodes on the same array, no problem). And the slog is also great for DB access because it reduces writes to the HDD from the databases, which are very small and random writes, causing low performance and defragmentation.

If you use zfs anyway for your own files, then there is not much overhead. but if you install it only for storagenodes and have to be careful about resource consumption, then it certainly isnā€™t for you.

Sidenote about zfs: I had 2 file corruptions that my scrub found, so I deleted those files :smiley:

1 Like

Well maybe they expect storj harddrives to die after 5 years :smiley: then the time factor is kind of constant xD

1 Like

I think that checksums are needed. I mean I would not want the data in my node to get corrupt over time without me knowing. Zfs by default runs a scrub once a month, so if there are problems, they hopefully would be found sooner, so that the data could be restored from the other drives (raid6).

However, that solves part of the problem - silent corruption of the hard drives. There is another potential problem where I would want local checksums for the storj files. Say, after a crash or some change, I would want to make sure that all files that my node is supposed to have are still there. The underlying zfs can be OK, but maybe the node reported that it had saved a bunch of files that it actually had not saved before the crash etc. So, a ā€œzfs scrubā€ would be useful in this case - read all files, check if they are present and not corrupt. At least knowing the presence and size of the file would probably be enough, as the corruption of contents is less likely with zfs.

1 Like

another point is data stability, configurations and the storj program itself is stable and thus doesnā€™t have bit flips or errors causing instability and such.

i really like that i can start the program and then 6 months later its still running without a single problemā€¦

granted storj doesnā€™t really support that anymore because of the continual updates, but stillā€¦ its nice to not have to think about it to often.

iā€™ve had a few systems run for years, usually pretty tricky and require some maintenance while using zfs and debianā€¦ itā€™s basically just turning the system on whenever lol and it will run for years and years most likelyā€¦

I donā€™t think you understand the problem. It doesnā€™t matter what makes the node less reliable or how those issues arise. When it hits a threshold it gets disqualified. This ensures that problems DONā€™T get worse over time, because problem nodes get ejected.

I was mentioning the risk of repair being unsuccessful. Since my new nodes didnā€™t see repair in the first 2 months and barely any repair for 2 months more, you can consider this a risk per quarter if Iā€™m being conservative. Really doesnā€™t matter if the numbers are this low though.

No you donā€™t. You have to audit and verify that the node is good enough. The system is built around the assumption that nodes can never be fully trusted. They donā€™t need 100% perfect.

Audits are mostly random, with the exception of a system that audits new nodes with some priority to get them vetted faster.

Which is why literally every storage provider cites durability numbers in the number of 9ā€™s. This is the case for every solution you can think of. File durability is ALWAYS a matter of probability.

Btw, downloading and checking all data on the network in any human timeframe is not possible. The costs would just be too high.

Iā€™ve read some of the other responses and it basically all comes down to wanting your node to be absolutely perfect. Thatā€™s noble, but the simple truth is that Storj doesnā€™t require you to do that. And since it can never know what nodes are going to be that perfect it will have to assume all nodes are flawed. I run one of my nodes on an HDD that has been kicked out of an array because of read errors. And I can assure you itā€™s performing just as well as your checksummed super reliable node. It has never failed a single audit. Good enough is just that, good enough.

3 Likes

Good sum up.
Faire enough, and actually with all your explanations I think it makes good sense.
Itā€™s just annoying to have a node with audit scores constantly changing between 80% and 99% ^^

In case of a crash you only lose the pieces not yet written to disk and/or database. That should typically not be more than 10-60 seconds even if you drive is really busy. So even at 20Mbps ingress you would have received only 120Mbit of data, so around 14MB. If your node has 4TB stored, thatā€™s 0.00035% of your total storage. That will be completely irrelevant in the long run.
So it also comes down to BrightSilenceā€™s summary:

2 Likes

It comes very close. What with no backups and stuff. At least the uptime requirement is not as harsh now AFAIK. Especially with the software not being very ā€œfailover/loadbalancing-friendlyā€.

Itā€™s nowhere near the level where you would have to worry about file corruption on an otherwise healthy HDD though. Or about corruption from sudden shut downs. If the HDD hardware survives without permanent harm, your node will too.

There are other possible problems as well, like the HBA or some driver or zfs freezing which would write data right to /dev/null until the node software tries to sync() and then freezes as well, but until then it could happily report ā€œyep, saved that fileā€ to the satellite.

Yet more examples of issues that wonā€™t add up to enough damage for disqualification. Or for it even being noticed. The only way it could lead to a problem is if that situation sustains for a very long time. And if it does, itā€™s a great sign of it being an unreliable node and it actually should be disqualified then.

Iā€™d say these examples are also a great argument for less complicated setups.

1 Like

Or, if there was a way for me to notice them before they become big enough to be noticed by the satellite, maybe I would be able to do something to prevent the problem from becoming too bad.

Because, you know, less complicated setups do not lose data. HDDs used without RAID would never develop bad sectors or just fail completely, right :slight_smile:?
Complicated setups have their own problems, but Iā€™d say those problems are not as bad as the alternative.
Those examples, by the way, could also be used as arguments for more complicated setups, like ceph or some other cluster with replication every minute or so (assuming the satellite can tolerate loss of one minute of incoming data on the node).

I donā€™t think there is anything stopping you from monitoring data integrity on your end if you want to do this. Iā€™m not sure what you are arguing for here.

I never said this

Nor this

Those examples were examples of extra layers of complexity introducing new points of failure. You just mentioned a bunch more that would introduce new points of failure.

Furthermore they trade off simple issues like bad sectors (which unless the HDD is failing completely would not lead to disqualification). Or whole disk failure (which if youā€™re running multiple nodes on multiple HDDā€™s isnā€™t that big a deal). For more complicated issues, like any of these complexity layers failing and taking the whole thing down at once.

Those complex layers work well in data center settings where the implementation is very finely tuned and the hardware is of data center quality and redundancy across the entire chain as well to back up those finely tuned implementations. At which point you are creating insane overkill implementations for a storagenode. Especially considering that everyone running their node on an RPi4 is getting the exact same income.

1 Like

Yeah, these partitions are only for Storj. For my own stuff Iā€™ve got better file systems.

I donā€™t spend hardware on these things for a storage node, I still donā€™t think itā€™s worth the time/effort/material.

No need for more hardware if your os runs on an ssd for example. But if you need another hardware, then it obviously loses its benefits. We only want to get the best out of the hardware we already have.

maybe it is good to be able to pay STORJ Token to be able to repair the lost information of your node, or use the Held ammount (if there is something left) to be able to repair what is missing.

Another way could be that Storj keeps your node payments for a while or they put your balance in negative. If repairing the node costs me 120 dollars but monthly I get 40 dollars of profit, I wouldnā€™t mind having 3 months without earning that money.

Itā€™s really not worth it for Storj to repair data on a node that has already proven to be unreliable. No reasonable amount of money is going to be worth it to risk data loss on a proven unreliable node.

1 Like

But the errors are not always because the node is not reliable at all, there are times when there are hardware errors, either due to a damaged sector on the hard disk or a problem in RAM memory. There are problems that can be avoided and others that cannot, and bad RAM is not going to completely suspend an 8-12 TB node that has taken months or even years to fill.

For example, my node was suspended because it was moving data from the nodeā€™s hard drive to another, that generated an IO Delay that the node took as damaged data since it took so long to serve them, when the data was finished, everything returned to normal. Imagine now that a RaidZ has to do the resilvering while you have the node running, exactly the same thing would happen, increasing the use of the hard disk and resources a lot.

The information will be recovered in another node, but why not give the node operator the opportunity to recover it, even if it is with a negative balance or by paying to repair the damage caused to the network by an external error?

Thank you

From the perspective of the satellite the only thing that matters is that the node didnā€™t correctly respond to a request for data. Chances are that nodes that did that enough to get disqualified will likely do that again if kept alive. Since there are plenty of nodes on the network which never failed a single audit, itā€™s just not worth it to keep those question mark nodes around. Especially since itā€™s impossible for the satellite to differentiate between a node with a one time issue or one that is starting to fail completely.

Theyā€™ve promised their customers 11 9ā€™s of durability. Taking a bet on risky nodes would not get them there. Ensuring all nodes are reliable at all times does.

With that said, I donā€™t think your example is a very great excuse either. I have hosted nodes since the start of V3 and have never failed a single audit on any of them (known storj issues that have since been fixed excluded). I perform raid resync every month without issues and have moved nodes to different storage locations without issues. As have many others.

You should set a different standard for yourself as to what is acceptable. Monitor node behavior during heavy operations and take action before things go wrong. If you donā€™t, then realize you are competing against node operators who can provide that kind of stability to the network and the network may not be as interested in your nodes in comparison.

Now I donā€™t want to discourage anyone from running a node. You take the risks you find acceptable. But you will also have to then accept the consequences should they pop up.

3 Likes

Any news on this thread, Storj plans to provide a solution to this problem that happens often to operators?

The hardware will always give problems whether you are a novice or experienced user, but losing a complete node because of a part of the data does not look good.

Being able to pay to repair one of your nodes would be a good option, if we have money in the Health amount it could be taken from there and otherwise deducted from the next payment. I donā€™t think anyone is opposed to being able to save their node if they can do it, many of us are discouraged by losing a large node of more than 2TB.