Storage node to Satellite : I've lost this block, sorry

Yep that is matching my expectation. The lower the alpha value is the higher the impact of an audit failure. Your alpha value is around 10. The highest possible value would be 20.

Hello there! :slight_smile:

Iā€™d like to give my opinion after a few months now, following my nodeā€™s incident that lost ~2000 files (see Storage node to Satellite : I've lost this block, sorry - #70 by Pac).

Right now, this node is still online, is now vetted on all sats except one (just a matter of days now) and works well.

However, Iā€™m still closely monitoring its audit scores, and even though they are now alright most of the time, they keep getting audit failure now and then, sometimes 2 or 3 very close to each other and it feels like the node will never get back to a fully healthy status, and honestly I dislike that.
Audit history as of today on 2 satsā€™ (reminder: up-to-date stats on Audits N3 (scores) - ThingSpeak IoT) are as follows:
audits-01 image

Even though it feels like itā€™s very unlikely that this could cause a disqualification in the end, and even though statistically speaking itā€™s probably not a problem for the network, Iā€™m still having a hard time to understand why it is ā€œokayā€ that my node lost thousands of files and can continue being part of the network without repair.

Also, in my particular case, I know what happened, I know I lost some files and I know I will never recover them, theyā€™re lost for good.
But because itā€™s just a little percentage of the whole Node, it will probably not get disqualified for that.
I should be happy about it, but on the other hand it bugs me that satellites keep auditing these files, causing scores to drop regularly, display yellow warnings on my dashboard every other day.

But for me, the real problem is that itā€™s now difficult for me to know if these audit failures are due to my past incident, or if itā€™s because my new setup is now starting to fail again. And thatā€™s quite annoying :confused:
Moreover, now that I see yellow all the time on my dashboard, Iā€™m getting used to it, and it does not stress me out like it should because now I just think ā€œyeah fine I know, this nodeā€™s broken anywayā€ and Iā€™m not reacting correctly to warnings anymore.

That is why, and I now some people already suggested this in the past, I must say I would appreciate a feature where I can tell satellites that Iā€™m willing to ā€œrepairā€ my node, and pay for it. I would even be ready to pay this at customerā€™s rate: If a node is severely damaged, it gets disqualified. But if a node survives some file loss, itā€™s necessarily because it has a very small percentage or number of files damaged, which means that repairing the node shouldnā€™t cost a lot compared to what the node earns.

In my particular case, I lost roughly 2000 files. Each file is around 2MB I believe, which means I lost ~4GB of data. Repairing this amount of data at customerā€™s price of egress (download) would be 0.004TB*45$ I believe, which is 0.18 USD.

Considering my node makes at the very least 0.5$/month (because itā€™s fairly new), it means that in the middle of the month I would have enough earned money on this node to be able to trigger some kind of repair operation, and if I could do that Iā€™d be happy to do it so my Node goes back to a steady green status!

Such a feature would have several advantages in my opinion:

  • Small incidents like mine could be repaired voluntarily by SNOs themselves, so they do not have repeated warnings on their dashboards, bringing their node to a healthy status again, and to some kind of peace of mind after an incident for SNOs.
  • The network would be even more robust, because there would be less nodes with dangling lost files here and there.
  • Repairs would cost a bit less to StorjLabs as SNOs could pay for these repairs (Iā€™m assuming Iā€™m not the only who would be willing to pay small repairs like this one).

Because right now, my only options are:

  • Leaving my node as it is, praying for the odds not to audit too many missing files in a row (very very unlikely), with yellow warnings regularly poping up on my dashboard, and hoping that after 6 months, 1 year, 2 years (who knows?) my missing files will slowly get repaired elsewhere so they stop being auditedā€¦
  • Killing my node to start a fresh new healthy one: That would cost a lot of repairs to StorjLabs, and it would take me a couple of months to get back to my current nodeā€™s vetting state and reputation.

Any chance such a feature could be at least considered?

2 Likes

I agree but there is no way of knowing exactly which files you are missing except for a full scan of your node from the satellite which would be very time consuming.
From what I understood from this thread that seems to be the biggest drawback.

1 Like

My understanding is that it would be incredibly resource consuming to have an exhaustive list of damaged files: that is files that are there, but corrupted: that would indeed require the satellite to audit all the files that are on a node to find out which ones are invalid.

But, with regards to missing files, I believe the Node knows which ones it should have, as it uses this mechanism to know which ones should be deleted (garbage collection). Right?

Based on that, it probably also knows which files are missing from the disk: I donā€™t see why the node could not ask for repairing these specific files.


With regards to corrupted files, the node should get a failed audit and its score should take a hit when this happens, but the satellite could record the fact that this node doesnā€™t have this fragment anymore so it does not get audited in the futureā€¦

Unfortunately not. If it could have this knowledge it could be abused. The garbage collector initiated by satellite - the satellite sends a bloom filter, which should say to node, which pieces should not be there.

When the number of missed pieces reach the threshold, the repair job will be triggered, then lost pieces will be recovered and pointer to failed node will be removed from the database and satellite will not audit this piece on that node anymore, because itā€™s now on other node.

So this could not be done in reverse? To know which files should be there?

Yes indeed, but this could take years :slight_smile:


Alright what Iā€™m asking for might not be sensible, some of it might even be impossibleā€¦
But I thought it was important to explain why thereā€™s a bit of frustration with this node which lost some files and is now ā€œbroken for everā€ ^^ā€™
Surely Iā€™m not the only one in this case (thereā€™s at least @andrew2.hart too :wink:).

Just in case some workarounds could be found. If at least satellites were flagging files as not held by this node anymore whenever an audit fails, it would get back to a healthy state way quicker I think.
Just my 2 cents :innocent:

Too expensive. Each file on your node should be audited, it will take a lot of time and money: every single piece should be downloaded from your node for the same rate as a repair, but repair job will download only required minimum number of pieces when time to come, not all as in case of full audit.
So itā€™s much cheaper for the network to just fix it sometime in the future.

1 Like

So basically thereā€™s really nothing that could be improved it seems to ā€œfix these kind of broken nodesā€ :confused:

Thanks anyway for your responses @Alexey :+1:

2 Likes

While I feel for your situation I guess Iā€™m a bit more of a hard liner on data loss. I donā€™t think your node should be allowed to survive if it has lost that many files. Audit scores recover too quickly for my taste allowing nodes to end up in this endless loop of scores dropping and recovering. If it took a month or even a week to recover from a failed audit, your node would be out of the running, but it would still allow for some small file corruption or a slow/no response.

Bloom filters are tricky. They donā€™t represent a list of all pieces that should be there but rather a pattern that all pieces that should be there would match. Itā€™s a means of compressing a long list of IDs into a very small package. The trade-off is that it will match everything that should be there, but also some things that shouldnā€™t be there. About 10% of garbage will match the bloom filter and wonā€™t get cleaned up on the first run. It will be cleaned up in future runs though.

This basically means it will match 10% of all possible piece IDs. So clearly it canā€™t be used to determine what pieces you are missing.

Your node could also never tell the satellite to repair data. Even if all other issues could be overcome, at most it would tell the satellite a piece got lost, but the satellite would still only initiate repair when the available amount of pieces drops below the repair threshold.

The purpose for audits is to judge whether a node is reliable. Not whether specific data is still on a node. Only a tiny fraction of a nodes data is ever audited, so taking action on those individual pieces is nonsensical. Just look at the miniscule amount of audit traffic your node sees. Itā€™s negligible.

3 Likes

And I agree with you @BrightSilence :slight_smile:

But if, based on data science (and probably a lot of complicated statistical stuff I donā€™t get) my node is still considered as ā€œreliable enoughā€ by all satellites apparently, Iā€™m just a bit disappointed that I cannot repair it so itā€™s back to a clean state. Looks like Iā€™m kinda stuck with a slightly broken node for ever.

Apparently! Thanks a lot for the explanations.
Well maybe there are other waysā€¦ sending a list of files with a hash for each one, if the node asks for it? So it knows whatā€™s missing or corrupted?

I guess youā€™ll tell me again that it would be a lot of work for the satā€™ for almost nothing.


This particular node has been configured to hold 500GB of data only after the ā€œincidentā€, I guess itā€™ll stay like this. Or Iā€™ll just kill it as it did not really bring any kind of revenue so far, itā€™s too young.

But it really feels like it would cost a lot less to the network to fix a couple of GBs that are missing, instead of killing the whole Node and repairing the 250+GB of data it now holds. I donā€™t knowā€¦ I canā€™t stop thinking ā€œthere must be a better wayā€.

1 Like

humbug! (with respect)

I have come to see the failed audit as exactly ā€œI have lost that block sorryā€ If the satellite asks again, well, guess what? Still lost. It stops asking eventually.

The network is trust-less and doesnā€™t allow for generosity soā€¦anyone need any free initial eth? ;D

1 Like