Ability to recover part of lost data / restore the node from backup

Pac · October 31, 2020, 3:00pm

Probably a stupid thought, but… here it is:

I get that checking all data of a node is not even remotely doable, it would require a massive amount of data exchanged between the satellite and the node.

However I was thinking, could there be a way to check a hash representing subsets of data the node is supposed to hold? That would make it possible to search by dichotomy all files that are lost or corrupted.
It would obviously require quite a bit of work from the Node, but all the work is going to take place locally, so it should take a few minutes only for the node to compute the hash for the queried subset.

I’m not sure I’m being clear, here is an example:

The node asks for a total check to a satellite.
The satellite (which keeps track of all pieces and their hashes) splits the list of files the node is supposed to have in, let’s say 1 subset per 100GB of data, and sends their corresponding global hashes to the Node. If the node has 1TB of data, that would be 10 hashes.
The node receives these 10 hashes and starts computing the same thing on its side, to see whether they match or not (or it could possible have them cached I guess, as long as databases are not corrupted).
If one of the subsets does not match, it means files are corrupted or lost within this 100GB subset. So the node reports back to the satellite that something’s wrong within this subset and asks for 10 other hashes for this subset.
The satellite returns a hash per 10GB of this subset to the Node.
The Node checks the subset, to find out which subparts are corrupted.
… and so on until the node knows precisely what files are missing or corrupted.
The Node can now ask for a repair job for the few files that are missing/wrong.

I guess one of the issues to overcome is that files keep being received and deleted while the check is being processed by the Node to compute hashes. But as long as the check is timestamped, it could ignore files added after that timestamp, and it could check files that got deleted after that timestamp from the trash.
Also, it would probably require satellites to keep refreshing hashes whenever a files gets added/deleted, I’m not sure that would be sustainable from a CPU resource point of view, unless there would be a clever/mathematical way to update a hash with another one, instead of reparsing all files it represents…

I’m a bit confused myself by my own explanation I must say, but… the rough idea is basically to run checks based on hashes instead of actual data to cut down on the amount of data to be exchanged between satellites and Nodes, if the latter ask for a repair job.

Feel free to point out why this is probably not doable either

Alexey · October 31, 2020, 3:43pm

This a wide surface for attacks, even without @littleskunk’s superpower I can figure that out.
As a malicious operator I would lie on every single request, that I still have all those hashes (I simple just return the same list to the satellite) and as a result will wipe all the data to receive a more profitable data to have downloads more often.
And repeat the same scenario if the new data is not downloaded as frequently as I want.
The lie of timestamps is even more simpler.

I will have difficulties to lie, if I do not know, what the correct answer without having an actual piece in place.
The satellite not only ask for hashes, it downloads part of the audited piece!
So, I must have a piece to be able to pass audit.
To confirm that I own all pieces, I should pass audit for every single piece.
Every single piece should be downloaded to the satellite for audit for the same price as a repair. But in case of repair job the satellite will download only minimum required pieces, not all as in case of full audit. And the repair job will be triggered only when the number of healthy pieces will fall below a threshold.

Pac · October 31, 2020, 5:17pm

Oh but I’m not saying satellites should rely on that at all!
Satellites should keep the current system to make sure a node is reliable, that’s for sure.

I’m suggesting a system that could be triggered on Nodes’ initiative, in case some files were lost, to make sure the node gets repaired, to get back to a healthy state with 100% of score everywhere!

jeremyfritzen · October 31, 2020, 5:17pm

We could adjust the proposition made by @Pac:

If the satellite doesn’t send its own list of hashes but just asks the node to send the list of hashes for data at a specific time (based on timestamp), would it be beter in terms of security?

Though, there is still a problem regarding data deleted after the Satellite computed the hashes and before the node computes the hashes.

Alexey · October 31, 2020, 5:40pm

Too expensive to form such a list by a request. It’s time and CPU consuming operation. This is one of the reasons why the satellites sends Bloom filter to remove a garbage instead of list of known hashes.
Too much work to solve the problem “what are pieces missed” with useless results.

The same pieces will not be recovered anyway. The repair job is working in a different way - it downloads any 29 remained pieces, and calculates missed pieces then upload them to the network (there is no mechanism for fixing a destination of those pieces, they will be placed on random nodes and specifically not on the same nodes), then pointers will be updated to a new pieces locations. Then the garbage collector will remove deleted pointers from the storagenodes with a help of bloom filter.

BrightSilence · November 4, 2020, 10:32pm

Additionally the only outcome of significant damage detected with that proves would have to be the satellite disqualifying the node because it obviously can’t be trusted. It’s a lose lose situation.

Pac · November 5, 2020, 7:15am

Alright, alright…
It’s just so weird to me that a node could survive after losing some data, because it “is covered by statistcs” (as in it’s not broken enough) but would get disqualified if it was actually trying to make things right…

Pentium100 · November 5, 2020, 7:31am

Yeah, this seems strange to me too. Trying to do the right thing and admit some data loss would result in worse consequences than trying to cover it up and hoping those files do not get audited one after another.

BrightSilence · November 5, 2020, 8:31am

Well, it would result in the same consequences. You’d just be accelerating the process.

Alexey · November 5, 2020, 8:36am

I would add - to the time, when the threshold is reached the customer may remove their files and we would not need to run a repair

Pentium100 · November 5, 2020, 8:49am

So, hiding the fact that I have lost data and hoping to not get the “wrong” audits one after another is better than admitting the problem.
This, kind-of feels weird, but maybe there is no better way.

SGC · November 5, 2020, 9:22am

isn’t it also a bit like the whole when one looses a hdd in a raid… one gets a step closer to data corruption, sure with erasure coding there are many more steps, but if there is x amount of corrupted data in the chain, the one essentially doesn’t know how close one is to failure…

then the longer a node can survive with corrupted data, the larger the odds are that others in the chain fails and it will be required or requested to help restore / repair the data.

just like when restoring a lost drive on a raid, has very different odds of failure depending on if it takes a couple of hours or two weeks.to restore / resilver the lost drive…

yes erasure coding has many more pieces and is mathematically expanded much sooner than needed…

but if one doesn’t know how many pieces there is corrupt in a chain can it really be counted on, as over time corruption will sneak into nodes, and a damaged file here and a damage file there won’t DQ a node… so eventually atleast in theory it will all align and a erasure coding data chain will fail…

ofc one could include a warning mark… so one sort of has a trigger that will tell an data engineer that now this chain has reached it’s repair point and during repair it was discovered that an additional 5 pieces of the chain was corrupted or whatever threshold one would want…

ofc 5 pieces corrupted won’t mean data loss, but over time the numbers will go up, because there will continually become more and more silent corruption in the data.

this way if the problem isn’t solved in the future a warning would be issued and the repair threshold could simply be raised to deal with the issue… if the network isn’t to large to simply raise the threshold in the time allotted .

it’s most certainly not the best approach atleast for the network… but so long as one can ensure the data doesn’t gets corrupted i don’t suppose it really matters… and ofc you might get punished when the data that went corrupt 10 months ago when your node was 4 months old, and now then all of a sudden it reaches the repair threshold, and immediately a lot of your corrupt files are requested, because it was all uploaded at the same time and now the customer is back or whatever…

and BOOM your node is dead… 10 months after the initial damage, seemly random, but really it was just that time it died 10 months ago, the network just hadn’t noticed it was dead yet…

there must be a better way… maybe so local checksum tracking… like zfs
ofc if one could report it… then lets say you report 2/3 of the data that was corrupted… and lets say that is the max data you can report that is corrupted in say a year… then you just decreased your odds of your storagenode dying by 2/3

i suppose this essentially comes down to the everybody lies view point… the satellites doesn’t trust a single node, they cannot because that would essentially break the erasure coding chain if it relied on a single data point.

so any data we / storagenodes provide cannot be trusted and have to be verified by multiple other data points.

there must be a better way tho… not sure i can see it right now… this all gets a bit abstract and long…

i know 1 guy that has a corrupt file he knows about because of zfs… seems odd that he cannot report that as broken… it would benefit both him and the network…

and i don’t see why loosing a single file out of i think he has 4million files or so… so 1 in 4mil isn’t bad… thats like a 99.9999% successrate pretty good when running without ecc memory / consumer grade hardware

and yet the standing official storj labs view is that if you have corrupted data your node should be DQ…
doesn’t seem fair if it’s actually at the limits of what the hardware itself can do… 99.9999% maybe kinda close to the theoretical limit of the hdd data cohesion.

but i’m sure it will be sketched out better when Storj Labs has more time to spend on such minor long term annoying and almost theoretical problems.

BrightSilence · November 5, 2020, 9:45pm

You can know exactly what the risk is though. The disqualification threshold is tuned to that. I don’t have the exact numbers, but say a node can fail 1% of audits without being disqualified. Worst case there would be a 1 in 100 chance of a piece not being on an otherwise healthy node. Currently the repair threshold is at 52 pieces. At that point you can lose another 17 pieces before repair is no longer possible. So at that point there is a (1/100)^17=1e-34. That’s 34 9’s of durability.
And this is assuming that all nodes are right on that threshold, most nodes are of course much healthier than that.
I picked 1% kind of at random, but even if you allow nodes to fail up to 10% of audits, with the current repair settings that would still lead to 17 9’s of durability.

You know the worst case option and through monitoring you can know how close you get. I’m sure Storj Labs collects data on repair processes and how often pieces can’t be returned during repair. And they will tune the repair threshold accordingly. It’s not a matter of exact knowledge, but a matter of statistics. This is exactly why it’s so important to disqualify nodes that have lost data. The chances of other data being lost are much higher on nodes that have had previous failures and that throws the entire thing off balance.

It won’t the network will reach an equilibrium because nodes with too much bad data will be disqualified. Only at first will it get worse over time until this equilibrium is reached. But the network has been online long enough with enough nodes that we’re probably in that equilibrium state already.

This claim requires evidence that isn’t provided. Since it’s a balancing game between fostering a reliable SNO base vs taking a tiny risk on corruption flying under the radar for a short time. If this system prevents even a few unreliable SNOs from setting up and removing nodes constantly, it’s definitely going to be a net positive. Allowing less reliable behavior will have a LOT more impact on durability than a letting nodes with a few corrupted files exist for a little longer. Especially since partial repair also allows unreliable storage to stay on the network for longer. I don’t see that trade off working out very well, but I don’t have the stats to confirm. Storj Labs does though. The also have a data science team who’s job it is to interpret that data and make the right decisions. So, please provide your evidence.

That’s simply not how it works. Only audits are used to judge your nodes reliability. You would start failing audits right after the corruption has happened. It might take a bit until you get disqualified if the corruption level is low, but it won’t suddenly happen months later. If it takes months, you’ll see your reputation dropping during those months.

sigh You should get a limit of how often you’re allowed to mention zfs. Zfs is great for specific things, but its systems aren’t designed to work in a distributed network with untrusted nodes. Local checks are meaningless to satellites as the node can’t be trusted. So why trust the outcome of checksums locally?

I would just adjust my nodes to tell satellites that I’ve lost pieces that are never downloaded. Make some space for more profitable data. It’s way too easy to cheat a system like that. This would lead to a massive increase in repair costs too, so who’s going to pay for that?

Where does it say you’re not allowed to have a single corrupt piece? You can have tons of corrupted pieces before you are disqualified. If you have only one corrupt piece, it’ll likely never even be audited.

You’re thinking about this in a completely different way than how Storj has been designed. It’s impossible to know exactly which pieces are and aren’t there. That would require way too much data exchange. The best you can do is ensure a minimum level of reliability that nodes are allowed to have and design the redundancy and repair systems around that. Disqualification ensures that nodes with a lower reliability are kicked out and RS and repair ensures that with those reliability limits file durability is ensured.

Here’s the bottom line: Any system that allows to repair specific pieces is a trade off between node reliability vs piece reliability. You allow less reliable nodes to stay online and trade that for specific knowledge about a specific piece or list of pieces. Since piece reliability is meaningless for satellites, because they still have no idea whether the corruption reports are complete and can’t rely on the data to begin with. All that remains is that you’re allowing nodes that lost data to stay online. Which means you’re allowing nodes that aren’t reliable to recover from their damage, which will just lower the average reliability on the network.

You really need to stop thinking about individual pieces. The satellites never deal (and can’t deal) with reputation on that level. The only question the satellite is interested in is: “how reliable is your node?”

Pentium100 · November 5, 2020, 10:32pm

A local check would be useful to me to run on my own node to see if I should continue running it or just delete the data because the node will be DQ anyway.

BrightSilence · November 5, 2020, 10:39pm

Then I suggest you use a filesystem that supports checksumming.

Pac · November 5, 2020, 10:40pm

Any recommendation? (20 chars)

BrightSilence · November 5, 2020, 10:45pm

I think you should ask @SGC
But I guess you’re gonna make me say it. ZFS is probably the most used filesystem with checksumming at its core. I just don’t feel like this is a Storj problem to solve.

SGC · November 5, 2020, 11:07pm

my opinions… below
if you don’t mind tinkering a bit, and learning something that’s very different, and plan to work with large storage solutions, then zfs is the way to go…

but if you want stuff to be simple and easy to understand… ehhh… well zfs is that… when you learned it… but it’s very different to everything else.

there are a few others filesystems out there with checksums, some more modern than zfs… zfs is getting old in some aspects… but it’s most likely still the best solution you can get… it’s just amazing what it can do, but all those advantages also comes with some major drawbacks…

like if we are talking ZoL (ZFS on Linux) now being rebranded as … OpenZFS i think… and merged with the FreeBSD version… when they finally release OpenZFS 2.0 … which i’ve been waiting for … most of 2020

in ZoL there isn’t any way to remove drives from a raid array… you can go up… and really if you do something like raid5 on 4 drives, then you really should add 4 drives when you upgrade…

so you end up going like 4, 8 , 12 or 8 drives for a raid6 and then +8 more drives for the upgrade…
it’s really built to scale… into unreal scales… but sadly most of us, i kinda doubt needs that level of expansion.

so zfs may be the most mature, most reliable, most expandable, cheapest, easiest to manage storage system you can setup… but it can be a very a lot of trouble at times…

also works fine for 1 hdd… i mean i would never go back to not using filesystems without checksums, but really i’m beyond that now… i will never store my data without redundancy again… simply to much trouble to deal with damaged data and pisses me of.

i can highly recommend zfs, just as much as it makes me want to punch a wall at times, hopefully it will be fixed in the near future… but it won’t… its basically an impossible math problem, i kinda decided to start running mirror pools…because they are super fast, high iops and zfs lets me add and remove mirrors how i want… so thats my stop gap measure until i find something better or gets pissed enough that i will go find something better… or maybe OpenZFS 2.0 gives me the ability to remove drives wrong a zfs raid array / pool, but i so doubt that… saw a lecture with the guy that made the FreeBSD version of that… and it worked but didn’t sound like something one would want to use…

but having the option is nice… maybe… might just make one use it and then end up with an even more messed up pool long term…

so yeah not sure if that was a recommendation or a warning…
like wendel of the youtube channel level 1 tech’s says… linked above
ZFS is the filesystem the Starship Enterprise would use…

Toyoo · November 5, 2020, 11:30pm

For kinda this reason I started my nodes on btrfs and it was fine. In theory I can run a scrub and if I notice that a file went bad, I can just manually remove it and claim it was never here freeing space for other files. But I doubt it’s worth operator’s time…

I’m now slowly migrating them to ext4, as I just don’t think there’s any real value in btrfs for storage nodes, and ext4 has lower I/O and space overhead.

SGC · November 6, 2020, 12:15am

BrightSilence:

You can know exactly what the risk is though. The disqualification threshold is tuned to that. I don’t have the exact numbers, but say a node can fail 1% of audits without being disqualified. Worst case there would be a 1 in 100 chance of a piece not being on an otherwise healthy node. Currently the repair threshold is at 52 pieces. At that point you can lose another 17 pieces before repair is no longer possible. So at that point there is a (1/100)^17=1e-34. That’s 34 9’s of durability.
And this is assuming that all nodes are right on that threshold, most nodes are of course much healthier than that.
I picked 1% kind of at random, but even if you allow nodes to fail up to 10% of audits, with the current repair settings that would still lead to 17 9’s of durability.

i don’t think you completely understand the problem with that… silent data corruption is just that… silent… thus over time the data will be corrupted across nodes and the longer the data is stored the more likely the odds become that it has become corrupted…

but yeah i don’t completely understand the details of the … whatever the thing we are running is called… the tardigrade backbone… maybe… ill call it that… atleast the it has a name that makes sense.

maybe storj labs did take that into account and matched the audits so that more data is audited pr node than is needed to counter the silent / avg hdd data corruption ratios, if there is such a thing… i’m sure people can pull out a lot of numbers, but who knows… detailed stuff, so easy answers … well those most just vaporized when exposed to detailed study.

it only takes a few bytes being wrong to invalidate an entire piece… so with 17 to go on… you think that makes it highly unlikely, and yet your math doesn’t involved any time factor… and yeah sure from a timeless immediate perspective it looks very unlikely to happen… if not impossible since e-34 is like astronomical numbers and even beyond… isn’t grahams number like e-37 or something

alas i digress… the corruption is a factor of time, but like you say… thats what DQ is for… but really… checking all the data on a storagenode… that is a tall order, so essentially one will trust that the node is okay… but with normal harddrives thats rarely the case… you can rarely read a full hdd without having a couple of errors… even the manufactures write that in their documentation.

thats the problem with it being silent corruption… it will be like 0.1% of a storagenode that is corrupt… and it may not be read until the time its for repairing the network…

only real way i can see for protecting against it actually being a danger would be having a sort of fuse that gets triggered runs a sort of earlier repair cycle restores the erasure code chain… ofc that may be built into the erasure code math for all i know…

could be very possible now that i think about it, because long term it would never be able to store data without something like that…and lets assume it’s pretty damn resistant since governments have been trying to kill certain erasure code chains for what… decades now… not related to storj but stuff like data piracy and usenet and such.

i assume the 52 piece is on different nodes, and thus the 17 pieces are on 17 different nodes…

so what you are trying to tell me is that it’s unlikely that 17 nodes in a row has corruption, even tho any storagenode larger than a few tb will most certainly have a level of corruption… sure you have 17 to spare… which is 1/3 of the total pieces …

ofc i did assume that audits was random… but i don’t suppose they are that either… which might also solve the issue, if the full traversal of audits of erasure chains are faster than the inherent avg corruption speed of hdd’s

so yeah i’m sure it’s well in hand even tho i don’t understand it… xD

just sort of saying it’s most often the stuff we didn’t expect to kill us that will… lol

it was meant to be able to handle corruption rather than having to read all data over the network… i duno how much data this is… but lets assume that the network would have to do the same maintenance as is recommended for… ZFS once a month… so to be sure the hdd data is reliable you would have to read the entire network data… so lets say it’s 17times better so once every 17 months… still reading the entire network in 1½ year seems like a lot of work… ofc one can kinda merge it with regular usage, which would greatly reduce the data required… ofc that would again give us the dead spots…

you make it sound so easy to ensure data integrity over time… but yeah maybe the magical stuff is in the erasure coding or the network keeps track of whats audited and will run through everything…

actually i think littleskunk said that, but the system needs to nodes… so they have to accept some amount of corruption ofc… but in a perfect world they wouldn’t want it at all.
but yeah that came out wrong.

but that’s kinda the problem… the data is stored on a degradable medium, there are no reliable nodes long term, only nodes that haven’t yet shown signs of their data corruption.

but maybe there is secret sauce in the erasure coding that solves that issue…
i just don’t see how one gets past that with how i understand the tardigrade backbone works.

you cannot just magically verify … the … data… unless if one used CHECKSUMS xD
i mean one should be able to do checksums without the computer doing the checksums actually needs to be trusted. its just sums… would ofc become a … load of data long term… but i suppose one could most likely do some sort of layered checksums to fix that.

maybe i need to start to store stuff in erasure codes lol… can’t ceph do that actually…

and if erasure coding / the storj network is probability based doesn’t that mean that one can essentially never be 100% sure the data is actually there… but i suppose one can never really have 100% when stuff directly interacts with the fabric of reality anyways… but still, in theory… it will always just be a lot of 99.9999…% sure it’s there