Storagenode Recovery Mode

How about a specific allowance for backup. That is, I restore from a backup to timestamp xyz, which means that all data after xyz is lost, but all data before it should remain. So, run a bunch of audits on the older data, if they succeed, recover the newer data and put it on some other nodes. My node could go back to vetting or whatever for 1k-10k audits etc.

There could be a maximum backup age as well (so I do not restore a 6 month old backup).

Stuff can fail, the database can get corrupted for example. It should be possible to recover from that. That what backups are for.

This is interesting as well.

Sorry this question doesnā€™t really fit here:
If the satellite asks for the hash of a piece, what prevents a storagenode from only storing hashes and throw away the data? Then it could suceed in all audits?
(From an earnings perspective that would still be stupid for a long time because egress pays a lot more than storage but once you get to a ā€œfakeā€ 50TB storage on a 1TB drive, that would be something.)

1 Like

This is very interesting to read. So I want to add my thoughts:

  1. From Storj perspective a node that has proven to be unreliable can no longer be trusted. So even if the SNO could fire up a new node, the environment can be totally different and reliable in the future. Also by starting over it is a form of long term commitment. This means it is a self selection process: A SNO that selects to get into this commitment again is rather trustworthy. It is the node that is not trusted, not the SNO.

  2. From the perspective of other nodes especially new nodes I can understand if they favor a drop-out of unreliable nodes. If a node loses 10 TB of data, this data will get distributed so they might get a piece of this cake. Also I think of the case when satellite has already started to redistribute data to other nodes an the failed node comes back online with ā€œhisā€ data pieces. I understand it that suddenly there are more pieces of a file in the network than needed. So what does it mean. It means lesser chance for other (reliable) SNOs to get a download. And maybe (but that I donā€™t know) satellite even deletes excess pieces. So I donā€™t know if this is a fair procedure.

  3. From the perspective of the SNO I can fully understand the desire to recover data. But I am not sure if local backups are the solution. First of all they require space that could be used as node. Second they require maintenance and resources to verify data integrity in case data gets recovered from a local backup. Third, it sounds really silly: Tardigrade is advertised as secure redundant online data store and nodes keep offline backups. They donā€™t trust Tardigrade?

That said, I believe the only way to recover a node would be to look at a node as a customer: A node that wants to recover data must download it from the Tardigrade network like a regular customer. And for data integrity, he must download the his entire data. So if he wants to restore a 10 TB node, he must download ā€œhisā€ complete 10 TB and of course pay for it like a regular customer. (Consider it as the opposite of a graceful exit: A graceful entry)
This solves many problems:

  1. Recovery does not put cost on Storj
  2. Other SNOs will be happy for every recovery because they get paid for downloads
  3. SNO can recover a node, and get his amount of data back instead of waiting long time
  4. High cost prevent SNO to abuse the system for multiple cheap recovery so it is still in their interest to run a reliable node.
  5. Node data can be trusted.

Maybe some computations must be made by the satellites to mitigate the loss of pieces that have been already distributed to other reliable nodes. (Meaning that pieces that have already been redistributed might no get restored on that node to be recovered)

The same concept could be used for moving nodes. I always thought that it is a bit silly if you move nodes to or from different places that you have to move the data as well. With a recovery mode, you would simply move the identity and metadata and download the pieces from the network. At least you could choose to do so.

There is also a final thought: I am aware that data loss can happen anytime. So every SNO can be affected any time. It could happen right now that my HD crashes and dies. If this is true I am just wondering, if repair traffic must have a price at all. As a SNO one day I profit for being paid for repair traffic, but the other day, I might profit from being able to restore my node for free or for low cost. So maybe recovery downloads should be free for Storj and seen as shared risk among all SNOs? Basically this is a case like every other daily life offline insurance and could be treated as such.

The correct way to be for the satellite to ask for the piece, compute the hash itself and compare it to the one stored in its database. If they match - the data is valid.
If the node managed to create invalid data with the correct hash and do it better than just storing the data, well, I think there would be much bigger problems than just being able to cheat on audits.

Hereā€™s the problem: 1) that, again, makes backups pointless and while I would rather pay $450 to download 10TB than start over, that seems a bit excessive. Why should I pay for Storj profit? Would the nodes I download from get $20/TB or $10 because it is repair?

But then how can you trust that if I download everything from the network I keep that data? What if I download the data and only keep what I already do not have in my backup?

But my backups cannot be trusted. OK. My node cannot be trusted either then. Even if it passed 100k audits, doesnā€™t mean my node has any of the other data.

In addition, what if the node fails because of a bug in the software or something else caused db corruption? I do not have control over that. Why should I be held responsible for it then? If I modified the node and then it got corrupted etc, OK, but what if I am running the unmodified version?

Correct. Maybe backups are pointless. Somebody has to bear the cost. For a backup a SNO needs space. Space he could otherwise rent out. For a local backup to be trusted, the satellite must compute the pieces. And still a local backup also can fail and produce garbage.

Well but that is what you are doing today already with the escrowed amount. I agree that this is a lot of money, however of course it must be high enough to prevent abuse like frequent recovery operations from unreliable nodes. Because somebody has to bear the cost.

Maybe the only way to trust a local backup would be if the nodes application would run and verify them and encrypt them with a key so that the SNO cannot fiddle with them. But any other local way to handle backups is not trustworthy for the satellites.
Basically everything is just a matter of probability but not in a mathematical way. If you passed 100k audits, it is very likely that your node is reliable. But it is no prove. As you said, you can fiddle with the data or your hardware blows up. Hence even new drives can fail any second.

I think that is a good point and I think it is valid. But I am wondering if the databases are the problem. Compared to the real data stored, the databases are fairly small, right? I think these could be backed up hourly to some different media. But I have not thought about it, if this would solve anything.

True. But if pay for it, why would you not keep it?

If it was possible I would run the database in a cluster while keeping logs so I could restore to any point before the corruption.
Or maybe access to the data should not depend on the database? Even if I delete all of the database files, the actual data is still there and should be accessible to the customer.

To make the recovery faster, I can run the download script on a server with faster connection and only keep the files I do not have already. If my node has 10TB of data and loses 100GB, I can download the 10TB on a server with a 10G connection (and a 200GB drive) while only keeping the 100GB I do not have. Then I could move the 100GB to my actual node using my internet connection that is slower than 10G.

  1. A zfs snapshot consumes very little space and can protect from quite a few problems. It will not protect from three drives failing in the raidz2 pool, but it would protect from database corruption and from a wrong rm -rf command.
  2. Because the data on the node does not change a lot, I could make a full backup to tape and keep the (much smaller) differentials on hard drives. I cannot run a node on a tape drive.
  3. I could use network storage or SMR drives for the backups.

Backups protect from mistakes that anyone can make. Just having more space does not.

I have outlined a way for it in the other thread.
Other than that, local backups are not more and not less trustworthy than either my current node and especially my new node.

One use case i could see for me personally in something like this is that i have my storagenode located on a zfs pool created by proxmox, sadly proxmox is using or zol is using the Solaris base code or parts of it for zfs. ZoL, openzfs whatever i dunoā€¦

it runs 8k minimum blocksizes, because of this, my system gets a x16 io amplification because my ssdā€™s are 512byte sector sizes and thus my old SATA ssds tho reasonably fast and more than what i really needā€¦ cap out at 4k io

to solve this i need to remake my poolā€¦ the only way to do that is if i actually could manage to move itā€¦ but currently i donā€™t have any space of significance outside the pool, so in theory it could be uploaded and downloaded even if it would take a while to upload itā€¦maybe a week or twoā€¦
iā€™m sure ill find a better wayā€¦ because that is a pretty long timeā€¦ but would be pretty cool to have the optionā€¦ to like tell the networkā€¦ i got to shutdown for like 14 days while i move my datacenterā€¦ or whateverā€¦ and then it would take the data back into the cloudā€¦ allow you to move or whatever and then when you reconnect you might have higher priority than other nodes or whateverā€¦ getting the data back is kinda unrealistic in most casesā€¦

@jammerdan local backups could be snapshotsā€¦ snapshots are a feature of some storage solutions, ie zfs, which has the ability to take a instant full image of a drive and then all changes from that point is what takes up new placeā€¦ and then it can be integrated into each other afterwardsā€¦

so say i take a snapshot a noon yesterday. and today the network started uploading corrupt data because of some kind of errorā€¦ then by satellite request or whatever i could roll back my snapshot and thus remove any data changed since noon yesterday, which could be usefulā€¦
lots of ways to utilize itā€¦ and yes it does ofc use a bit of spaceā€¦ but it uses the existing data as a reference map, thus only the changes take up space.

but paying 4.5$ to restort the 100gb out of 9.9tb
tho to be fairā€¦ i donā€™t care about the 100gb lostā€¦ i care that the node doesnā€™t dieā€¦ and the data is wasted, and ofc the should be punishment, but if one has like a snapshot thats 12 hours old and all the data is good, only problem is that its an old db and only 99% of the data is thereā€¦
and i accept that itā€™s not something that is acceptable, but it is a nice feature which could save a great deal on repair jobs evenā€¦ which from what i understand on alexey are very expensiveā€¦

lots of interesting stuff about thisā€¦

Ok, I was thinking more like it requires to setup a new node with ā€œnewā€ identity and then download everything.
But maybe your answer is why it is the best approach from Storj to nuke a node once failed. If a SNO is tempted to fiddle with the data in such a case, like returning to the network in an unknown state and download only what is missing for personal gain, maybe it is in Storjs best interest that such a SNO has to start all over again. Maybe?

There is at least one distinction though: With a requirement for a data recovery your node has already proven to be unreliable and I donā€™t know how to get around that.

Do not forget that for big nodes (>10-20TB), which demonstrated to be reliable for many months or years, it will be harder to start from the scratch and some of them will leave the project.

1 Like

Lots of mixed discussion going on here. Lets try to separate a few things out, starting with database corruption that has been mentioned a couple of times.

As far as Iā€™m aware databases hold only non-vital information. That data is not necessary for your node to function. Currently the only way to fail audits because of the databases is when the databases are corrupt and canā€™t be accessed at all. These errors would fall under the unknown audits failure category and wouldnā€™t lead to disqualification, but to suspension instead. Suspension already gives the SNO time to recover from this error by either repairing the db, restoring a working db backup or starting over with an empty db with the right schema. So there is already a way out of db corruption that doesnā€™t involve any lost data and your node can be recovered. If there still is some db that is currently vital for audits (which I donā€™t believe is the case to begin with) then THAT should be fixed. We donā€™t need an additional method to recover from that.

2 scenarios left. Either the node has lost part of its data or it lost everything.

If the node has lost part of the data. Say 1%. It would be nice if it could continue on with the 99% that is still there. But I really donā€™t understand why everyone is so bend on restoring the missing data? The only way to restore that data is by repair and most of those pieces would not have hit the repair threshold. So that means wasting money on downloading 29 pieces to restore each 1 piece on your node. Itā€™s incredibly wasteful and the obvious option is much simpler. The node reports to the satellite what data is lost. The satellite marks the pieces as lost. And everyone moves on. The node would be punished by missing data that they used to have and the related income from that data. But the tradeoff is that it will no longer get audited for that data and gets to live on. Restoring the lost data shouldnā€™t even be considered in this scenario as it is of literally no benefit to the network. Repair will trigger automatically for pieces that need repair, through the existing systems.
Now this leads to a few challenges:

  • How does the node determine which pieces were lost, without super expensive checks from the satellite?
  • How do we prevent nodes from using this system to cover actual weaknesses in the hardware. If an HDD starts failing the node can scan the HDD and keep reporting new pieces as lost, while in that scenario the entire node should no longer be trusted.
  • How do you prevent cheating with this system by the node simply reporting all files that are audited as lost?
  • How do we pay for the repair that eventually needs to happen because of this? Held amount? Do we build up part of that held amount again?

The last scenario is that the node lost all files. Restoring this from the network runs into the same problem again, that itā€™s highly costly. It means downloading 29 pieces to recover 1. So the previous $450 per TB wouldnā€™t even get close to covering the cost. And sure you can use that repair to restore more pieces, but a lot of these segments would still likely have more than enough pieces on the network to not be even close to needing repair. So there is no remotely affordable way to restore the data. This node has also shown not to be reliable. So, it can claim that the problem was fixed, but we canā€™t simply trust that, so it needs to be vetted again. The data was also lost, so we need to use the held amount for repairs. So what would we need to do to keep this node in the network. Empty it, vet it again, use the held amount and build up new held amount. Soā€¦ basically start a new node. So the choice would be, use the same identity and have that failure hanging over your head or just start over. Iā€™d say SNOs are better of just starting over.

4 Likes

But then shouldnā€™t my email, wallet and IP address be banned too? If I lose 100GB of data and delete the rest, I have proven that I cannot run a node correctly.

But what about the alternatives I wrote about in the other thread (I think we should choose one of the threads and concentrate the discussion there)?

I guess the GET_AUDIT requests the whole piece, that would be the only way to reliably proof that the node is storing data correctly and not just storing hashes.

Alexey was probably referring to something theoretical about requesting only parts of a piece (as a response to SGC) and I mistook this information for the general principle of audits.

hehe hindsight is a wonderful thing, but yeah i fully agreeā€¦ with most of your assessments.
and
well my zfs could ofc tell me or provide me a list of what files have been affected in case it is unable to restore them.
sadly havenā€™t managed to be able to test this feature yetā€¦ havenā€™t hit it hard enough yetā€¦ kinda expected it to die before nowā€¦ xD

Yes, shall we stay here?

I am not really familiar with GE. It is my understanding that the date gets downloaded from the node and redistributed to other nodes on the network? Does it create costs for Storj?
I donā€™t think data downloaded from the initiating node will be paid. And ingress to other nodes would be free. But I believe data must be verified before uploading them to some other host, wouldnā€™t they?

AFAIK, during GE, the exiting node uploads its data directly to other nodes or something. The idea is to avoid paying other nodes for GET_REPAIR. The exiting node does not get paid for egress, but gets the escrow money back.

@kevink
no not reallyā€¦ satellite sends an algorithm to three or more nodes which a certain set of piece being held by all of them, the nodes are asked by the sat to apply this on the data pieces in question.

here is my idea thus farā€¦
this gives back a checksum created from reading all the data and doing some basic math, taking up 1/100000000000 or whatever of the original data, the satellite then after some hours or whatever job size dependent ofc have passed, gets back the checksums and compares themā€¦ then it maybe stores the created checksum or in some way utilizes the processing the nodes did to make the checksums to build a sort of checksum puzzle that the nodes cannot figure out, because

1 they donā€™t know the algorithm they need to apply, and they donā€™t know the pieces that will be selected, and because of the many variations one cannot prepare this data without having the actual data working as a focal lens to verify the integrity of the wholeā€¦

that way the satellites just needs to keep a map of sortsā€¦

there is no real usage of bandwidth

and all data on the nodes could be verified and if one nest and scales it in the right ways it is very likely that one would also be able to identify exact pieces that are brokenā€¦ but for right now that is outside the scope of what i can do on the back of a napkin conceptual storjnerding.

Can you explain this further? How can this be more expensive than a customer downloading data from Tardigrade?

Because a customer trying to download a file needs all data from that file. It downloads 29 pieces to retrieve basically 100% of that size. But one node only holds one piece, however, to recreate that single piece you still need to download 29 pieces to rebuild it.

Think of it as parity. If you have a RAID5 array with 5 disks and you read a 4MB file, you just read 4 MB. But if you want to rebuild the parity information on that file, that parity information is only 1MB, but you still need to read all 4MB to create it. (Iā€™m guessing this is one of the few places where this analogy is helpful, let me know if itā€™s not. I can try better)

1 Like

Ok, but why do we need to recreate the piece and not just download it?
This is my understanding:
We need 29 pieces to recreate a file.
So we are storing 29 pieces on nodes (x-times for redundancy).
It is 29 different pieces, but the pieces itself is always the same.
So if one node reports that it is missing piece 3 from a particular file, it cannot download this particular piece from another node? This is what you are saying?

This is the part where youā€™re wrong. Pieces arenā€™t duplicated anywhere. Any piece is unique. If your node had piece 3, it was the only node that had piece 3. To recreate it you need at least 29 pieces again.

Edit: This is also why repair doesnā€™t take place with every piece lost, but when the number of pieces falls below a threshold of 35. At that point it can download 29 pieces and recreate all 55 missing pieces at once, overcoming this inefficiency problem. The difference is that this single node recovery by definition is only aimed at repairing the single piece on that single node, which is really inefficient.

2 Likes