Storagenode Recovery Mode

Correct. Maybe backups are pointless. Somebody has to bear the cost. For a backup a SNO needs space. Space he could otherwise rent out. For a local backup to be trusted, the satellite must compute the pieces. And still a local backup also can fail and produce garbage.

Well but that is what you are doing today already with the escrowed amount. I agree that this is a lot of money, however of course it must be high enough to prevent abuse like frequent recovery operations from unreliable nodes. Because somebody has to bear the cost.

Maybe the only way to trust a local backup would be if the nodes application would run and verify them and encrypt them with a key so that the SNO cannot fiddle with them. But any other local way to handle backups is not trustworthy for the satellites.
Basically everything is just a matter of probability but not in a mathematical way. If you passed 100k audits, it is very likely that your node is reliable. But it is no prove. As you said, you can fiddle with the data or your hardware blows up. Hence even new drives can fail any second.

I think that is a good point and I think it is valid. But I am wondering if the databases are the problem. Compared to the real data stored, the databases are fairly small, right? I think these could be backed up hourly to some different media. But I have not thought about it, if this would solve anything.

True. But if pay for it, why would you not keep it?

If it was possible I would run the database in a cluster while keeping logs so I could restore to any point before the corruption.
Or maybe access to the data should not depend on the database? Even if I delete all of the database files, the actual data is still there and should be accessible to the customer.

To make the recovery faster, I can run the download script on a server with faster connection and only keep the files I do not have already. If my node has 10TB of data and loses 100GB, I can download the 10TB on a server with a 10G connection (and a 200GB drive) while only keeping the 100GB I do not have. Then I could move the 100GB to my actual node using my internet connection that is slower than 10G.

  1. A zfs snapshot consumes very little space and can protect from quite a few problems. It will not protect from three drives failing in the raidz2 pool, but it would protect from database corruption and from a wrong rm -rf command.
  2. Because the data on the node does not change a lot, I could make a full backup to tape and keep the (much smaller) differentials on hard drives. I cannot run a node on a tape drive.
  3. I could use network storage or SMR drives for the backups.

Backups protect from mistakes that anyone can make. Just having more space does not.

I have outlined a way for it in the other thread.
Other than that, local backups are not more and not less trustworthy than either my current node and especially my new node.

One use case i could see for me personally in something like this is that i have my storagenode located on a zfs pool created by proxmox, sadly proxmox is using or zol is using the Solaris base code or parts of it for zfs. ZoL, openzfs whatever i duno…

it runs 8k minimum blocksizes, because of this, my system gets a x16 io amplification because my ssd’s are 512byte sector sizes and thus my old SATA ssds tho reasonably fast and more than what i really need… cap out at 4k io

to solve this i need to remake my pool… the only way to do that is if i actually could manage to move it… but currently i don’t have any space of significance outside the pool, so in theory it could be uploaded and downloaded even if it would take a while to upload it…maybe a week or two…
i’m sure ill find a better way… because that is a pretty long time… but would be pretty cool to have the option… to like tell the network… i got to shutdown for like 14 days while i move my datacenter… or whatever… and then it would take the data back into the cloud… allow you to move or whatever and then when you reconnect you might have higher priority than other nodes or whatever… getting the data back is kinda unrealistic in most cases…

@jammerdan local backups could be snapshots… snapshots are a feature of some storage solutions, ie zfs, which has the ability to take a instant full image of a drive and then all changes from that point is what takes up new place… and then it can be integrated into each other afterwards…

so say i take a snapshot a noon yesterday. and today the network started uploading corrupt data because of some kind of error… then by satellite request or whatever i could roll back my snapshot and thus remove any data changed since noon yesterday, which could be useful…
lots of ways to utilize it… and yes it does ofc use a bit of space… but it uses the existing data as a reference map, thus only the changes take up space.

but paying 4.5$ to restort the 100gb out of 9.9tb
tho to be fair… i don’t care about the 100gb lost… i care that the node doesn’t die… and the data is wasted, and ofc the should be punishment, but if one has like a snapshot thats 12 hours old and all the data is good, only problem is that its an old db and only 99% of the data is there…
and i accept that it’s not something that is acceptable, but it is a nice feature which could save a great deal on repair jobs even… which from what i understand on alexey are very expensive…

lots of interesting stuff about this…

Ok, I was thinking more like it requires to setup a new node with “new” identity and then download everything.
But maybe your answer is why it is the best approach from Storj to nuke a node once failed. If a SNO is tempted to fiddle with the data in such a case, like returning to the network in an unknown state and download only what is missing for personal gain, maybe it is in Storjs best interest that such a SNO has to start all over again. Maybe?

There is at least one distinction though: With a requirement for a data recovery your node has already proven to be unreliable and I don’t know how to get around that.

Do not forget that for big nodes (>10-20TB), which demonstrated to be reliable for many months or years, it will be harder to start from the scratch and some of them will leave the project.

1 Like

Lots of mixed discussion going on here. Lets try to separate a few things out, starting with database corruption that has been mentioned a couple of times.

As far as I’m aware databases hold only non-vital information. That data is not necessary for your node to function. Currently the only way to fail audits because of the databases is when the databases are corrupt and can’t be accessed at all. These errors would fall under the unknown audits failure category and wouldn’t lead to disqualification, but to suspension instead. Suspension already gives the SNO time to recover from this error by either repairing the db, restoring a working db backup or starting over with an empty db with the right schema. So there is already a way out of db corruption that doesn’t involve any lost data and your node can be recovered. If there still is some db that is currently vital for audits (which I don’t believe is the case to begin with) then THAT should be fixed. We don’t need an additional method to recover from that.

2 scenarios left. Either the node has lost part of its data or it lost everything.

If the node has lost part of the data. Say 1%. It would be nice if it could continue on with the 99% that is still there. But I really don’t understand why everyone is so bend on restoring the missing data? The only way to restore that data is by repair and most of those pieces would not have hit the repair threshold. So that means wasting money on downloading 29 pieces to restore each 1 piece on your node. It’s incredibly wasteful and the obvious option is much simpler. The node reports to the satellite what data is lost. The satellite marks the pieces as lost. And everyone moves on. The node would be punished by missing data that they used to have and the related income from that data. But the tradeoff is that it will no longer get audited for that data and gets to live on. Restoring the lost data shouldn’t even be considered in this scenario as it is of literally no benefit to the network. Repair will trigger automatically for pieces that need repair, through the existing systems.
Now this leads to a few challenges:

  • How does the node determine which pieces were lost, without super expensive checks from the satellite?
  • How do we prevent nodes from using this system to cover actual weaknesses in the hardware. If an HDD starts failing the node can scan the HDD and keep reporting new pieces as lost, while in that scenario the entire node should no longer be trusted.
  • How do you prevent cheating with this system by the node simply reporting all files that are audited as lost?
  • How do we pay for the repair that eventually needs to happen because of this? Held amount? Do we build up part of that held amount again?

The last scenario is that the node lost all files. Restoring this from the network runs into the same problem again, that it’s highly costly. It means downloading 29 pieces to recover 1. So the previous $450 per TB wouldn’t even get close to covering the cost. And sure you can use that repair to restore more pieces, but a lot of these segments would still likely have more than enough pieces on the network to not be even close to needing repair. So there is no remotely affordable way to restore the data. This node has also shown not to be reliable. So, it can claim that the problem was fixed, but we can’t simply trust that, so it needs to be vetted again. The data was also lost, so we need to use the held amount for repairs. So what would we need to do to keep this node in the network. Empty it, vet it again, use the held amount and build up new held amount. So… basically start a new node. So the choice would be, use the same identity and have that failure hanging over your head or just start over. I’d say SNOs are better of just starting over.


But then shouldn’t my email, wallet and IP address be banned too? If I lose 100GB of data and delete the rest, I have proven that I cannot run a node correctly.

But what about the alternatives I wrote about in the other thread (I think we should choose one of the threads and concentrate the discussion there)?

I guess the GET_AUDIT requests the whole piece, that would be the only way to reliably proof that the node is storing data correctly and not just storing hashes.

Alexey was probably referring to something theoretical about requesting only parts of a piece (as a response to SGC) and I mistook this information for the general principle of audits.

hehe hindsight is a wonderful thing, but yeah i fully agree… with most of your assessments.
well my zfs could ofc tell me or provide me a list of what files have been affected in case it is unable to restore them.
sadly haven’t managed to be able to test this feature yet… haven’t hit it hard enough yet… kinda expected it to die before now… xD

Yes, shall we stay here?

I am not really familiar with GE. It is my understanding that the date gets downloaded from the node and redistributed to other nodes on the network? Does it create costs for Storj?
I don’t think data downloaded from the initiating node will be paid. And ingress to other nodes would be free. But I believe data must be verified before uploading them to some other host, wouldn’t they?

AFAIK, during GE, the exiting node uploads its data directly to other nodes or something. The idea is to avoid paying other nodes for GET_REPAIR. The exiting node does not get paid for egress, but gets the escrow money back.

no not really… satellite sends an algorithm to three or more nodes which a certain set of piece being held by all of them, the nodes are asked by the sat to apply this on the data pieces in question.

here is my idea thus far…
this gives back a checksum created from reading all the data and doing some basic math, taking up 1/100000000000 or whatever of the original data, the satellite then after some hours or whatever job size dependent ofc have passed, gets back the checksums and compares them… then it maybe stores the created checksum or in some way utilizes the processing the nodes did to make the checksums to build a sort of checksum puzzle that the nodes cannot figure out, because

1 they don’t know the algorithm they need to apply, and they don’t know the pieces that will be selected, and because of the many variations one cannot prepare this data without having the actual data working as a focal lens to verify the integrity of the whole…

that way the satellites just needs to keep a map of sorts…

there is no real usage of bandwidth

and all data on the nodes could be verified and if one nest and scales it in the right ways it is very likely that one would also be able to identify exact pieces that are broken… but for right now that is outside the scope of what i can do on the back of a napkin conceptual storjnerding.

Can you explain this further? How can this be more expensive than a customer downloading data from Tardigrade?

Because a customer trying to download a file needs all data from that file. It downloads 29 pieces to retrieve basically 100% of that size. But one node only holds one piece, however, to recreate that single piece you still need to download 29 pieces to rebuild it.

Think of it as parity. If you have a RAID5 array with 5 disks and you read a 4MB file, you just read 4 MB. But if you want to rebuild the parity information on that file, that parity information is only 1MB, but you still need to read all 4MB to create it. (I’m guessing this is one of the few places where this analogy is helpful, let me know if it’s not. I can try better)

1 Like

Ok, but why do we need to recreate the piece and not just download it?
This is my understanding:
We need 29 pieces to recreate a file.
So we are storing 29 pieces on nodes (x-times for redundancy).
It is 29 different pieces, but the pieces itself is always the same.
So if one node reports that it is missing piece 3 from a particular file, it cannot download this particular piece from another node? This is what you are saying?

This is the part where you’re wrong. Pieces aren’t duplicated anywhere. Any piece is unique. If your node had piece 3, it was the only node that had piece 3. To recreate it you need at least 29 pieces again.

Edit: This is also why repair doesn’t take place with every piece lost, but when the number of pieces falls below a threshold of 35. At that point it can download 29 pieces and recreate all 55 missing pieces at once, overcoming this inefficiency problem. The difference is that this single node recovery by definition is only aimed at repairing the single piece on that single node, which is really inefficient.


Right. This way it makes sense at least in a technical way.

You are right, the current system will download a piece to check its hash.
I tried to think how it could be expanded to the block of data :slight_smile:


What if the satellite instead of storing a hash for each data block, stores a list of salted hashes for every data block, keeping the salts secret.

When a node needs to reconfirm a piece of data, the satellite reveals one of the salts to the node. The node then must use the salt with the data block, compute a hash out of that, and return it to the satellite, which then can compare the salted hash from the node with its own salted hash.

That way, the node cannot just pre-compute a hash and delete the data, it will need to have the actual data in order to combine it with the hitherto unknown salt, and create a valid hash.

You can read details regarding audits here: