Storagenode Recovery Mode

SGC · May 16, 2020, 9:28pm

its not like we want some unfair advantage, we are saying stuff can go bad… and throwing away 99% good data can’t be fun for either party… nobody win’s from that aside from “new” “secretly unreliable” storagenodes coming into the network.

Alexey · May 16, 2020, 9:29pm

This started to be a plan and can be submitted as an idea.
The second problem, I, as a satellite operator must confirm my SLA to my customers. Your node proven as unreliable. How to mitigate the risk that your node will fail again and I should recover files again?
Why not choose a node that has not been faulty during the same period of time?

Pentium100 · May 16, 2020, 9:30pm

I as a SNO have also proven as unreliable. Can you trust my new node? Maybe it will accumulate 10TB of data and lose 100GB again?

Alexey · May 16, 2020, 9:31pm

Maybe, but it will lost only 5% of data and maybe in the future could fail again. But not now and not with a good amount of data.

SGC · May 16, 2020, 9:31pm

im fine with there being punishment for loosing data… will take a lot of work to get up to 99% also…
so not loosing what is gained might be very favorable…

Alexey · May 16, 2020, 9:35pm

So seems, you will start with vetting again
And how is it different from the current situation?

You lose money (as held amount in the current version or as paying in advance for the possibility to retrieve your data back).
You will start again with zero reputation and will come through vetting to prove that your node could work (as in current version with a new node).

Pentium100 · May 16, 2020, 9:36pm

I will not have to wait a year to get 10TB of data.

Alexey · May 16, 2020, 9:36pm

Valid point

SGC · May 16, 2020, 9:36pm

i refuse to believe throwing away 99% good data can help your data reliability to customers, isn’t it suppose to be so it can be repaired… seems like a perfect thing to do for 1% loss…

one day SNO’s might be attacked by a virus that is targeted to shut down enough nodes to damage the network… nothing in this world is permanent, no storage solution is perfect…

but to make a storage system function at its best possible efficiency one needs to utilize as much data as possible to predict and avoid loss of data… in whatever form it takes…

there is also a big difference from loosing 1% data and having random corruption in all the data and injecting that into the network…
dropping a big block can happen… even for professionals

Alexey · May 16, 2020, 9:38pm

The problem is - if your node lost some amount of data, there is no practical way to prove that you do not altered the remained part. To check every piece is too expensive, even more expensive than just recover it some time in the future.
However, I believe that 1% could not disqualify your node. Every failed audit is marked as lost and when the number of lost pieces will fall below the threshold the repair job will be triggered.

Pentium100 · May 16, 2020, 9:42pm

OK on random lost files, but how about a backup? If I restore a day old backup, all the old data should still be there and you have no more or less proof of that than you have now.

Alexey · May 16, 2020, 9:44pm

How the satellite can be sure in it? Only by words of node?
It can’t check the whole data for reasonable amount of time.

Pentium100 · May 16, 2020, 9:45pm

How does it know now?

Alexey · May 16, 2020, 9:46pm

By audits in time.
But you suggesting to trust the node immediately after data losing.

SGC · May 16, 2020, 9:47pm

its to expensive to test over the internet yes… but then you are saying you don’t trust the SNO which would only be a problem if they got new ID’s whenever they made a new node…

ofc a SNO needs to be reliable and trust worthy without that the whole concept falls apart…
but audits check data… and will catch errors, ofc they will only check what is used… the rest one checks locally by checksums and such…

like today i injected 4gb of zeros into two drives on my system… ended up having to do a reinstall and a scrub to restore my storagenode data, but errors will happen… errors happen all the time…
but because of a proper setup it can restore it with ease so my data is back to 100%
and not a byte lost… but i could lose data… punishing the top tier nodes on the same level as a random newly joined node, doesn’t make sense…
and data will change… data changes on regular hdds over extended periods, they cannot be trusted to store data… if you cannot allow corruption in the data you cannot allow sno’s to store on singular drives.

Alexey · May 16, 2020, 9:48pm

With a new ID they will not have a 10TB of unknown quality of data.
The risk much much lower, almost in 10^13 times

@Pentium100
I think your idea could be submitted to the ideas.

You could be allowed to recover a missed files, but you should pay in advance the amount which is enough to recover the whole data. The held amount could reduce the amount to pay.

There is still one problem - how to determine which data is lost. The whole data is audited during a long amount of time, so either satellite or the node do not know which data is lost right now and should be recovered using the held amount.

Hm, it becomes a separate idea - take a fee for each failed audit

SGC · May 16, 2020, 9:57pm

checksums man checksums

pretty sure my zfs doesn’t throw big chunks of data out because they are bad… it tries to correct it…
by having tiered checksums of everything it can detect where errors are coming from… i understand that the storj system must have some sort of live datablock living among the SNO’s, but such a block would be bandwidth limited… it can only move around with a certain speed between the SNO’s and thus can be killed by unreliable storagenodes…which is why i suspect it has to work like that…

but that doesn’t mean that a perfectly good storagenode cannot be useful just because you don’t believe its not possible to verify the data… i duno that it is, but i do know that there are very few things in this world that is truly impossible, if any at that…

verifying node data would just require it to be some sort of checksum game… the node knows which pieces you have and asks you do apply some sort of math on them to verify that they are still good… because it’s only an equation being sent and the data is processed and computed locally and a result is sent back, then it doesn’t require vast bandwidth…
and to lower computational needs for the satellites they would calculate the stuff based on the larger pieces and use them randomly or something…
its a bit of a puzzle… but i’m pretty sure it could be done…
storj V4 heheh

Alexey · May 16, 2020, 10:00pm

It’s uses a crypto alternative of the checksums already by the way. This is still audit process. Which is a long and expensive operation.
You can’t check all pieces for the reasonable amount of time. It will take hours (we have an internet channel there). Exclusively for the one node.
I didn’t say that it is impossible. It’s expensive. In some cases it’s the same

And you described almost exactly how the audit process works.
However, it’s not speed up the process, it’s still long and expensive.

SGC · May 16, 2020, 10:09pm

well you scale it into hypersize checksum blocks… i just scanned and corrected my 10tb storage pool
sure it costs locally, yes checking each block individually, but saying … hey node you got these 100k blocks which is named this… give me a checksum based on this random algorithm or whatever its called.
then the exchange online is minimal, even if both satellite will need to have some sort of reference map of it… but it could essentially use the other nodes to build that… so it will compare checksums from both to evaluate if the hyper block / piece is verified good.

Alexey · May 16, 2020, 10:13pm

Unfortunately pieces are not tied with each other on the one node. It’s the purpose of the IP filter by the way. Your node must not have pieces of the same file. The satellite only have an addresses (pointers) where is pieces are located, but have no idea regarding information in the pieces. So, it can ask for the hash of the exact piece, which should be on the exact storagenode, but not the block.

I believe It could group them by the node in the database and calculate the hash of hashes (?), but it will be additional computation work. And for only the one node? From thousands? And what if there are hundreds of such nodes?