Storagenode Recovery Mode

Pentium100 · May 16, 2020, 11:12pm

How about a specific allowance for backup. That is, I restore from a backup to timestamp xyz, which means that all data after xyz is lost, but all data before it should remain. So, run a bunch of audits on the older data, if they succeed, recover the newer data and put it on some other nodes. My node could go back to vetting or whatever for 1k-10k audits etc.

There could be a maximum backup age as well (so I do not restore a 6 month old backup).

Stuff can fail, the database can get corrupted for example. It should be possible to recover from that. That what backups are for.

This is interesting as well.

kevink · May 17, 2020, 5:20am

Sorry this question doesn’t really fit here:
If the satellite asks for the hash of a piece, what prevents a storagenode from only storing hashes and throw away the data? Then it could suceed in all audits?
(From an earnings perspective that would still be stupid for a long time because egress pays a lot more than storage but once you get to a “fake” 50TB storage on a 1TB drive, that would be something.)

jammerdan · May 17, 2020, 5:30am

This is very interesting to read. So I want to add my thoughts:

From Storj perspective a node that has proven to be unreliable can no longer be trusted. So even if the SNO could fire up a new node, the environment can be totally different and reliable in the future. Also by starting over it is a form of long term commitment. This means it is a self selection process: A SNO that selects to get into this commitment again is rather trustworthy. It is the node that is not trusted, not the SNO.
From the perspective of other nodes especially new nodes I can understand if they favor a drop-out of unreliable nodes. If a node loses 10 TB of data, this data will get distributed so they might get a piece of this cake. Also I think of the case when satellite has already started to redistribute data to other nodes an the failed node comes back online with “his” data pieces. I understand it that suddenly there are more pieces of a file in the network than needed. So what does it mean. It means lesser chance for other (reliable) SNOs to get a download. And maybe (but that I don’t know) satellite even deletes excess pieces. So I don’t know if this is a fair procedure.
From the perspective of the SNO I can fully understand the desire to recover data. But I am not sure if local backups are the solution. First of all they require space that could be used as node. Second they require maintenance and resources to verify data integrity in case data gets recovered from a local backup. Third, it sounds really silly: Tardigrade is advertised as secure redundant online data store and nodes keep offline backups. They don’t trust Tardigrade?

That said, I believe the only way to recover a node would be to look at a node as a customer: A node that wants to recover data must download it from the Tardigrade network like a regular customer. And for data integrity, he must download the his entire data. So if he wants to restore a 10 TB node, he must download “his” complete 10 TB and of course pay for it like a regular customer. (Consider it as the opposite of a graceful exit: A graceful entry)
This solves many problems:

Recovery does not put cost on Storj
Other SNOs will be happy for every recovery because they get paid for downloads
SNO can recover a node, and get his amount of data back instead of waiting long time
High cost prevent SNO to abuse the system for multiple cheap recovery so it is still in their interest to run a reliable node.
Node data can be trusted.

Maybe some computations must be made by the satellites to mitigate the loss of pieces that have been already distributed to other reliable nodes. (Meaning that pieces that have already been redistributed might no get restored on that node to be recovered)

The same concept could be used for moving nodes. I always thought that it is a bit silly if you move nodes to or from different places that you have to move the data as well. With a recovery mode, you would simply move the identity and metadata and download the pieces from the network. At least you could choose to do so.

There is also a final thought: I am aware that data loss can happen anytime. So every SNO can be affected any time. It could happen right now that my HD crashes and dies. If this is true I am just wondering, if repair traffic must have a price at all. As a SNO one day I profit for being paid for repair traffic, but the other day, I might profit from being able to restore my node for free or for low cost. So maybe recovery downloads should be free for Storj and seen as shared risk among all SNOs? Basically this is a case like every other daily life offline insurance and could be treated as such.

Pentium100 · May 17, 2020, 6:05am

The correct way to be for the satellite to ask for the piece, compute the hash itself and compare it to the one stored in its database. If they match - the data is valid.
If the node managed to create invalid data with the correct hash and do it better than just storing the data, well, I think there would be much bigger problems than just being able to cheat on audits.

Here’s the problem: 1) that, again, makes backups pointless and while I would rather pay $450 to download 10TB than start over, that seems a bit excessive. Why should I pay for Storj profit? Would the nodes I download from get $20/TB or $10 because it is repair?

But then how can you trust that if I download everything from the network I keep that data? What if I download the data and only keep what I already do not have in my backup?

But my backups cannot be trusted. OK. My node cannot be trusted either then. Even if it passed 100k audits, doesn’t mean my node has any of the other data.

In addition, what if the node fails because of a bug in the software or something else caused db corruption? I do not have control over that. Why should I be held responsible for it then? If I modified the node and then it got corrupted etc, OK, but what if I am running the unmodified version?

jammerdan · May 17, 2020, 6:30am

Correct. Maybe backups are pointless. Somebody has to bear the cost. For a backup a SNO needs space. Space he could otherwise rent out. For a local backup to be trusted, the satellite must compute the pieces. And still a local backup also can fail and produce garbage.

Well but that is what you are doing today already with the escrowed amount. I agree that this is a lot of money, however of course it must be high enough to prevent abuse like frequent recovery operations from unreliable nodes. Because somebody has to bear the cost.

Maybe the only way to trust a local backup would be if the nodes application would run and verify them and encrypt them with a key so that the SNO cannot fiddle with them. But any other local way to handle backups is not trustworthy for the satellites.
Basically everything is just a matter of probability but not in a mathematical way. If you passed 100k audits, it is very likely that your node is reliable. But it is no prove. As you said, you can fiddle with the data or your hardware blows up. Hence even new drives can fail any second.

I think that is a good point and I think it is valid. But I am wondering if the databases are the problem. Compared to the real data stored, the databases are fairly small, right? I think these could be backed up hourly to some different media. But I have not thought about it, if this would solve anything.

True. But if pay for it, why would you not keep it?

Pentium100 · May 17, 2020, 6:57am

If it was possible I would run the database in a cluster while keeping logs so I could restore to any point before the corruption.
Or maybe access to the data should not depend on the database? Even if I delete all of the database files, the actual data is still there and should be accessible to the customer.

To make the recovery faster, I can run the download script on a server with faster connection and only keep the files I do not have already. If my node has 10TB of data and loses 100GB, I can download the 10TB on a server with a 10G connection (and a 200GB drive) while only keeping the 100GB I do not have. Then I could move the 100GB to my actual node using my internet connection that is slower than 10G.

A zfs snapshot consumes very little space and can protect from quite a few problems. It will not protect from three drives failing in the raidz2 pool, but it would protect from database corruption and from a wrong rm -rf command.
Because the data on the node does not change a lot, I could make a full backup to tape and keep the (much smaller) differentials on hard drives. I cannot run a node on a tape drive.
I could use network storage or SMR drives for the backups.

Backups protect from mistakes that anyone can make. Just having more space does not.

I have outlined a way for it in the other thread.
Other than that, local backups are not more and not less trustworthy than either my current node and especially my new node.

SGC · May 17, 2020, 7:11am

One use case i could see for me personally in something like this is that i have my storagenode located on a zfs pool created by proxmox, sadly proxmox is using or zol is using the Solaris base code or parts of it for zfs. ZoL, openzfs whatever i duno…

it runs 8k minimum blocksizes, because of this, my system gets a x16 io amplification because my ssd’s are 512byte sector sizes and thus my old SATA ssds tho reasonably fast and more than what i really need… cap out at 4k io

to solve this i need to remake my pool… the only way to do that is if i actually could manage to move it… but currently i don’t have any space of significance outside the pool, so in theory it could be uploaded and downloaded even if it would take a while to upload it…maybe a week or two…
i’m sure ill find a better way… because that is a pretty long time… but would be pretty cool to have the option… to like tell the network… i got to shutdown for like 14 days while i move my datacenter… or whatever… and then it would take the data back into the cloud… allow you to move or whatever and then when you reconnect you might have higher priority than other nodes or whatever… getting the data back is kinda unrealistic in most cases…

@jammerdan local backups could be snapshots… snapshots are a feature of some storage solutions, ie zfs, which has the ability to take a instant full image of a drive and then all changes from that point is what takes up new place… and then it can be integrated into each other afterwards…

so say i take a snapshot a noon yesterday. and today the network started uploading corrupt data because of some kind of error… then by satellite request or whatever i could roll back my snapshot and thus remove any data changed since noon yesterday, which could be useful…
lots of ways to utilize it… and yes it does ofc use a bit of space… but it uses the existing data as a reference map, thus only the changes take up space.

but paying 4.5$ to restort the 100gb out of 9.9tb
tho to be fair… i don’t care about the 100gb lost… i care that the node doesn’t die… and the data is wasted, and ofc the should be punishment, but if one has like a snapshot thats 12 hours old and all the data is good, only problem is that its an old db and only 99% of the data is there…
and i accept that it’s not something that is acceptable, but it is a nice feature which could save a great deal on repair jobs even… which from what i understand on alexey are very expensive…

lots of interesting stuff about this…

jammerdan · May 17, 2020, 7:18am

Ok, I was thinking more like it requires to setup a new node with “new” identity and then download everything.
But maybe your answer is why it is the best approach from Storj to nuke a node once failed. If a SNO is tempted to fiddle with the data in such a case, like returning to the network in an unknown state and download only what is missing for personal gain, maybe it is in Storjs best interest that such a SNO has to start all over again. Maybe?

There is at least one distinction though: With a requirement for a data recovery your node has already proven to be unreliable and I don’t know how to get around that.

JoshGarza · May 17, 2020, 7:19am

Do not forget that for big nodes (>10-20TB), which demonstrated to be reliable for many months or years, it will be harder to start from the scratch and some of them will leave the project.

BrightSilence · May 17, 2020, 7:21am

Lots of mixed discussion going on here. Lets try to separate a few things out, starting with database corruption that has been mentioned a couple of times.

As far as I’m aware databases hold only non-vital information. That data is not necessary for your node to function. Currently the only way to fail audits because of the databases is when the databases are corrupt and can’t be accessed at all. These errors would fall under the unknown audits failure category and wouldn’t lead to disqualification, but to suspension instead. Suspension already gives the SNO time to recover from this error by either repairing the db, restoring a working db backup or starting over with an empty db with the right schema. So there is already a way out of db corruption that doesn’t involve any lost data and your node can be recovered. If there still is some db that is currently vital for audits (which I don’t believe is the case to begin with) then THAT should be fixed. We don’t need an additional method to recover from that.

2 scenarios left. Either the node has lost part of its data or it lost everything.

If the node has lost part of the data. Say 1%. It would be nice if it could continue on with the 99% that is still there. But I really don’t understand why everyone is so bend on restoring the missing data? The only way to restore that data is by repair and most of those pieces would not have hit the repair threshold. So that means wasting money on downloading 29 pieces to restore each 1 piece on your node. It’s incredibly wasteful and the obvious option is much simpler. The node reports to the satellite what data is lost. The satellite marks the pieces as lost. And everyone moves on. The node would be punished by missing data that they used to have and the related income from that data. But the tradeoff is that it will no longer get audited for that data and gets to live on. Restoring the lost data shouldn’t even be considered in this scenario as it is of literally no benefit to the network. Repair will trigger automatically for pieces that need repair, through the existing systems.
Now this leads to a few challenges:

How does the node determine which pieces were lost, without super expensive checks from the satellite?
How do we prevent nodes from using this system to cover actual weaknesses in the hardware. If an HDD starts failing the node can scan the HDD and keep reporting new pieces as lost, while in that scenario the entire node should no longer be trusted.
How do you prevent cheating with this system by the node simply reporting all files that are audited as lost?
How do we pay for the repair that eventually needs to happen because of this? Held amount? Do we build up part of that held amount again?

The last scenario is that the node lost all files. Restoring this from the network runs into the same problem again, that it’s highly costly. It means downloading 29 pieces to recover 1. So the previous $450 per TB wouldn’t even get close to covering the cost. And sure you can use that repair to restore more pieces, but a lot of these segments would still likely have more than enough pieces on the network to not be even close to needing repair. So there is no remotely affordable way to restore the data. This node has also shown not to be reliable. So, it can claim that the problem was fixed, but we can’t simply trust that, so it needs to be vetted again. The data was also lost, so we need to use the held amount for repairs. So what would we need to do to keep this node in the network. Empty it, vet it again, use the held amount and build up new held amount. So… basically start a new node. So the choice would be, use the same identity and have that failure hanging over your head or just start over. I’d say SNOs are better of just starting over.

Pentium100 · May 17, 2020, 7:22am

But then shouldn’t my email, wallet and IP address be banned too? If I lose 100GB of data and delete the rest, I have proven that I cannot run a node correctly.

But what about the alternatives I wrote about in the other thread (I think we should choose one of the threads and concentrate the discussion there)?

kevink · May 17, 2020, 7:25am

I guess the GET_AUDIT requests the whole piece, that would be the only way to reliably proof that the node is storing data correctly and not just storing hashes.

Alexey was probably referring to something theoretical about requesting only parts of a piece (as a response to SGC) and I mistook this information for the general principle of audits.

SGC · May 17, 2020, 7:35am

hehe hindsight is a wonderful thing, but yeah i fully agree… with most of your assessments.
and
well my zfs could ofc tell me or provide me a list of what files have been affected in case it is unable to restore them.
sadly haven’t managed to be able to test this feature yet… haven’t hit it hard enough yet… kinda expected it to die before now… xD

jammerdan · May 17, 2020, 7:47am

Yes, shall we stay here?

I am not really familiar with GE. It is my understanding that the date gets downloaded from the node and redistributed to other nodes on the network? Does it create costs for Storj?
I don’t think data downloaded from the initiating node will be paid. And ingress to other nodes would be free. But I believe data must be verified before uploading them to some other host, wouldn’t they?

Pentium100 · May 17, 2020, 7:50am

AFAIK, during GE, the exiting node uploads its data directly to other nodes or something. The idea is to avoid paying other nodes for GET_REPAIR. The exiting node does not get paid for egress, but gets the escrow money back.

SGC · May 17, 2020, 7:51am

@kevink
no not really… satellite sends an algorithm to three or more nodes which a certain set of piece being held by all of them, the nodes are asked by the sat to apply this on the data pieces in question.

here is my idea thus far…
this gives back a checksum created from reading all the data and doing some basic math, taking up 1/100000000000 or whatever of the original data, the satellite then after some hours or whatever job size dependent ofc have passed, gets back the checksums and compares them… then it maybe stores the created checksum or in some way utilizes the processing the nodes did to make the checksums to build a sort of checksum puzzle that the nodes cannot figure out, because

1 they don’t know the algorithm they need to apply, and they don’t know the pieces that will be selected, and because of the many variations one cannot prepare this data without having the actual data working as a focal lens to verify the integrity of the whole…

that way the satellites just needs to keep a map of sorts…

there is no real usage of bandwidth

and all data on the nodes could be verified and if one nest and scales it in the right ways it is very likely that one would also be able to identify exact pieces that are broken… but for right now that is outside the scope of what i can do on the back of a napkin conceptual storjnerding.

jammerdan · May 17, 2020, 8:07am

Can you explain this further? How can this be more expensive than a customer downloading data from Tardigrade?

BrightSilence · May 17, 2020, 8:10am

Because a customer trying to download a file needs all data from that file. It downloads 29 pieces to retrieve basically 100% of that size. But one node only holds one piece, however, to recreate that single piece you still need to download 29 pieces to rebuild it.

Think of it as parity. If you have a RAID5 array with 5 disks and you read a 4MB file, you just read 4 MB. But if you want to rebuild the parity information on that file, that parity information is only 1MB, but you still need to read all 4MB to create it. (I’m guessing this is one of the few places where this analogy is helpful, let me know if it’s not. I can try better)

jammerdan · May 17, 2020, 8:19am

Ok, but why do we need to recreate the piece and not just download it?
This is my understanding:
We need 29 pieces to recreate a file.
So we are storing 29 pieces on nodes (x-times for redundancy).
It is 29 different pieces, but the pieces itself is always the same.
So if one node reports that it is missing piece 3 from a particular file, it cannot download this particular piece from another node? This is what you are saying?

BrightSilence · May 17, 2020, 8:25am

This is the part where you’re wrong. Pieces aren’t duplicated anywhere. Any piece is unique. If your node had piece 3, it was the only node that had piece 3. To recreate it you need at least 29 pieces again.

Edit: This is also why repair doesn’t take place with every piece lost, but when the number of pieces falls below a threshold of 35. At that point it can download 29 pieces and recreate all 55 missing pieces at once, overcoming this inefficiency problem. The difference is that this single node recovery by definition is only aimed at repairing the single piece on that single node, which is really inefficient.