Storage node to Satellite : I've lost this block, sorry

littleskunk · August 29, 2020, 9:52am

Yep that is matching my expectation. The lower the alpha value is the higher the impact of an audit failure. Your alpha value is around 10. The highest possible value would be 20.

Pac · October 19, 2020, 7:16pm

Hello there!

I’d like to give my opinion after a few months now, following my node’s incident that lost ~2000 files (see Storage node to Satellite : I've lost this block, sorry - #70 by Pac).

Right now, this node is still online, is now vetted on all sats except one (just a matter of days now) and works well.

However, I’m still closely monitoring its audit scores, and even though they are now alright most of the time, they keep getting audit failure now and then, sometimes 2 or 3 very close to each other and it feels like the node will never get back to a fully healthy status, and honestly I dislike that.
Audit history as of today on 2 sats’ (reminder: up-to-date stats on Audits N3 (scores) - ThingSpeak IoT) are as follows:
audits-01

Even though it feels like it’s very unlikely that this could cause a disqualification in the end, and even though statistically speaking it’s probably not a problem for the network, I’m still having a hard time to understand why it is “okay” that my node lost thousands of files and can continue being part of the network without repair.

Also, in my particular case, I know what happened, I know I lost some files and I know I will never recover them, they’re lost for good.
But because it’s just a little percentage of the whole Node, it will probably not get disqualified for that.
I should be happy about it, but on the other hand it bugs me that satellites keep auditing these files, causing scores to drop regularly, display yellow warnings on my dashboard every other day.

But for me, the real problem is that it’s now difficult for me to know if these audit failures are due to my past incident, or if it’s because my new setup is now starting to fail again. And that’s quite annoying
Moreover, now that I see yellow all the time on my dashboard, I’m getting used to it, and it does not stress me out like it should because now I just think “yeah fine I know, this node’s broken anyway” and I’m not reacting correctly to warnings anymore.

That is why, and I now some people already suggested this in the past, I must say I would appreciate a feature where I can tell satellites that I’m willing to “repair” my node, and pay for it. I would even be ready to pay this at customer’s rate: If a node is severely damaged, it gets disqualified. But if a node survives some file loss, it’s necessarily because it has a very small percentage or number of files damaged, which means that repairing the node shouldn’t cost a lot compared to what the node earns.

In my particular case, I lost roughly 2000 files. Each file is around 2MB I believe, which means I lost ~4GB of data. Repairing this amount of data at customer’s price of egress (download) would be 0.004TB*45$ I believe, which is 0.18 USD.

Considering my node makes at the very least 0.5$/month (because it’s fairly new), it means that in the middle of the month I would have enough earned money on this node to be able to trigger some kind of repair operation, and if I could do that I’d be happy to do it so my Node goes back to a steady green status!

Such a feature would have several advantages in my opinion:

Small incidents like mine could be repaired voluntarily by SNOs themselves, so they do not have repeated warnings on their dashboards, bringing their node to a healthy status again, and to some kind of peace of mind after an incident for SNOs.
The network would be even more robust, because there would be less nodes with dangling lost files here and there.
Repairs would cost a bit less to StorjLabs as SNOs could pay for these repairs (I’m assuming I’m not the only who would be willing to pay small repairs like this one).

Because right now, my only options are:

Leaving my node as it is, praying for the odds not to audit too many missing files in a row (very very unlikely), with yellow warnings regularly poping up on my dashboard, and hoping that after 6 months, 1 year, 2 years (who knows?) my missing files will slowly get repaired elsewhere so they stop being audited…
Killing my node to start a fresh new healthy one: That would cost a lot of repairs to StorjLabs, and it would take me a couple of months to get back to my current node’s vetting state and reputation.

Any chance such a feature could be at least considered?

TheMightyGreek · October 19, 2020, 8:36pm

I agree but there is no way of knowing exactly which files you are missing except for a full scan of your node from the satellite which would be very time consuming.
From what I understood from this thread that seems to be the biggest drawback.

Pac · October 19, 2020, 8:52pm

My understanding is that it would be incredibly resource consuming to have an exhaustive list of damaged files: that is files that are there, but corrupted: that would indeed require the satellite to audit all the files that are on a node to find out which ones are invalid.

But, with regards to missing files, I believe the Node knows which ones it should have, as it uses this mechanism to know which ones should be deleted (garbage collection). Right?

Based on that, it probably also knows which files are missing from the disk: I don’t see why the node could not ask for repairing these specific files.

With regards to corrupted files, the node should get a failed audit and its score should take a hit when this happens, but the satellite could record the fact that this node doesn’t have this fragment anymore so it does not get audited in the future…

Alexey · October 19, 2020, 9:12pm

Unfortunately not. If it could have this knowledge it could be abused. The garbage collector initiated by satellite - the satellite sends a bloom filter, which should say to node, which pieces should not be there.

When the number of missed pieces reach the threshold, the repair job will be triggered, then lost pieces will be recovered and pointer to failed node will be removed from the database and satellite will not audit this piece on that node anymore, because it’s now on other node.

Pac · October 19, 2020, 9:21pm

So this could not be done in reverse? To know which files should be there?

Yes indeed, but this could take years

Alright what I’m asking for might not be sensible, some of it might even be impossible…
But I thought it was important to explain why there’s a bit of frustration with this node which lost some files and is now “broken for ever” ^^’
Surely I’m not the only one in this case (there’s at least @andrew2.hart too ).

Just in case some workarounds could be found. If at least satellites were flagging files as not held by this node anymore whenever an audit fails, it would get back to a healthy state way quicker I think.
Just my 2 cents

Alexey · October 19, 2020, 9:34pm

Too expensive. Each file on your node should be audited, it will take a lot of time and money: every single piece should be downloaded from your node for the same rate as a repair, but repair job will download only required minimum number of pieces when time to come, not all as in case of full audit.
So it’s much cheaper for the network to just fix it sometime in the future.

Pac · October 19, 2020, 9:39pm

So basically there’s really nothing that could be improved it seems to “fix these kind of broken nodes”

Thanks anyway for your responses @Alexey

BrightSilence · October 21, 2020, 4:32pm

While I feel for your situation I guess I’m a bit more of a hard liner on data loss. I don’t think your node should be allowed to survive if it has lost that many files. Audit scores recover too quickly for my taste allowing nodes to end up in this endless loop of scores dropping and recovering. If it took a month or even a week to recover from a failed audit, your node would be out of the running, but it would still allow for some small file corruption or a slow/no response.

Bloom filters are tricky. They don’t represent a list of all pieces that should be there but rather a pattern that all pieces that should be there would match. It’s a means of compressing a long list of IDs into a very small package. The trade-off is that it will match everything that should be there, but also some things that shouldn’t be there. About 10% of garbage will match the bloom filter and won’t get cleaned up on the first run. It will be cleaned up in future runs though.

This basically means it will match 10% of all possible piece IDs. So clearly it can’t be used to determine what pieces you are missing.

Your node could also never tell the satellite to repair data. Even if all other issues could be overcome, at most it would tell the satellite a piece got lost, but the satellite would still only initiate repair when the available amount of pieces drops below the repair threshold.

The purpose for audits is to judge whether a node is reliable. Not whether specific data is still on a node. Only a tiny fraction of a nodes data is ever audited, so taking action on those individual pieces is nonsensical. Just look at the miniscule amount of audit traffic your node sees. It’s negligible.

Pac · October 21, 2020, 4:57pm

And I agree with you @BrightSilence

But if, based on data science (and probably a lot of complicated statistical stuff I don’t get) my node is still considered as “reliable enough” by all satellites apparently, I’m just a bit disappointed that I cannot repair it so it’s back to a clean state. Looks like I’m kinda stuck with a slightly broken node for ever.

Apparently! Thanks a lot for the explanations.
Well maybe there are other ways… sending a list of files with a hash for each one, if the node asks for it? So it knows what’s missing or corrupted?

I guess you’ll tell me again that it would be a lot of work for the sat’ for almost nothing.

This particular node has been configured to hold 500GB of data only after the “incident”, I guess it’ll stay like this. Or I’ll just kill it as it did not really bring any kind of revenue so far, it’s too young.

But it really feels like it would cost a lot less to the network to fix a couple of GBs that are missing, instead of killing the whole Node and repairing the 250+GB of data it now holds. I don’t know… I can’t stop thinking “there must be a better way”.

andrew2.hart · October 21, 2020, 5:39pm

humbug! (with respect)

I have come to see the failed audit as exactly “I have lost that block sorry” If the satellite asks again, well, guess what? Still lost. It stops asking eventually.

The network is trust-less and doesn’t allow for generosity so…anyone need any free initial eth? ;D