Yep that is matching my expectation. The lower the alpha value is the higher the impact of an audit failure. Your alpha value is around 10. The highest possible value would be 20.
Hello there!
Iād like to give my opinion after a few months now, following my nodeās incident that lost ~2000 files (see Storage node to Satellite : I've lost this block, sorry - #70 by Pac).
Right now, this node is still online, is now vetted on all sats except one (just a matter of days now) and works well.
However, Iām still closely monitoring its audit scores, and even though they are now alright most of the time, they keep getting audit failure now and then, sometimes 2 or 3 very close to each other and it feels like the node will never get back to a fully healthy status, and honestly I dislike that.
Audit history as of today on 2 satsā (reminder: up-to-date stats on Audits N3 (scores) - ThingSpeak IoT) are as follows:
Even though it feels like itās very unlikely that this could cause a disqualification in the end, and even though statistically speaking itās probably not a problem for the network, Iām still having a hard time to understand why it is āokayā that my node lost thousands of files and can continue being part of the network without repair.
Also, in my particular case, I know what happened, I know I lost some files and I know I will never recover them, theyāre lost for good.
But because itās just a little percentage of the whole Node, it will probably not get disqualified for that.
I should be happy about it, but on the other hand it bugs me that satellites keep auditing these files, causing scores to drop regularly, display yellow warnings on my dashboard every other day.
But for me, the real problem is that itās now difficult for me to know if these audit failures are due to my past incident, or if itās because my new setup is now starting to fail again. And thatās quite annoying
Moreover, now that I see yellow all the time on my dashboard, Iām getting used to it, and it does not stress me out like it should because now I just think āyeah fine I know, this nodeās broken anywayā and Iām not reacting correctly to warnings anymore.
That is why, and I now some people already suggested this in the past, I must say I would appreciate a feature where I can tell satellites that Iām willing to ārepairā my node, and pay for it. I would even be ready to pay this at customerās rate: If a node is severely damaged, it gets disqualified. But if a node survives some file loss, itās necessarily because it has a very small percentage or number of files damaged, which means that repairing the node shouldnāt cost a lot compared to what the node earns.
In my particular case, I lost roughly 2000 files. Each file is around 2MB I believe, which means I lost ~4GB of data. Repairing this amount of data at customerās price of egress (download) would be 0.004TB*45$ I believe, which is 0.18 USD.
Considering my node makes at the very least 0.5$/month (because itās fairly new), it means that in the middle of the month I would have enough earned money on this node to be able to trigger some kind of repair operation, and if I could do that Iād be happy to do it so my Node goes back to a steady green status!
Such a feature would have several advantages in my opinion:
- Small incidents like mine could be repaired voluntarily by SNOs themselves, so they do not have repeated warnings on their dashboards, bringing their node to a healthy status again, and to some kind of peace of mind after an incident for SNOs.
- The network would be even more robust, because there would be less nodes with dangling lost files here and there.
- Repairs would cost a bit less to StorjLabs as SNOs could pay for these repairs (Iām assuming Iām not the only who would be willing to pay small repairs like this one).
Because right now, my only options are:
- Leaving my node as it is, praying for the odds not to audit too many missing files in a row (very very unlikely), with yellow warnings regularly poping up on my dashboard, and hoping that after 6 months, 1 year, 2 years (who knows?) my missing files will slowly get repaired elsewhere so they stop being auditedā¦
- Killing my node to start a fresh new healthy one: That would cost a lot of repairs to StorjLabs, and it would take me a couple of months to get back to my current nodeās vetting state and reputation.
Any chance such a feature could be at least considered?
I agree but there is no way of knowing exactly which files you are missing except for a full scan of your node from the satellite which would be very time consuming.
From what I understood from this thread that seems to be the biggest drawback.
My understanding is that it would be incredibly resource consuming to have an exhaustive list of damaged files: that is files that are there, but corrupted: that would indeed require the satellite to audit all the files that are on a node to find out which ones are invalid.
But, with regards to missing files, I believe the Node knows which ones it should have, as it uses this mechanism to know which ones should be deleted (garbage collection). Right?
Based on that, it probably also knows which files are missing from the disk: I donāt see why the node could not ask for repairing these specific files.
With regards to corrupted files, the node should get a failed audit and its score should take a hit when this happens, but the satellite could record the fact that this node doesnāt have this fragment anymore so it does not get audited in the futureā¦
Unfortunately not. If it could have this knowledge it could be abused. The garbage collector initiated by satellite - the satellite sends a bloom filter, which should say to node, which pieces should not be there.
When the number of missed pieces reach the threshold, the repair job will be triggered, then lost pieces will be recovered and pointer to failed node will be removed from the database and satellite will not audit this piece on that node anymore, because itās now on other node.
So this could not be done in reverse? To know which files should be there?
Yes indeed, but this could take years
Alright what Iām asking for might not be sensible, some of it might even be impossibleā¦
But I thought it was important to explain why thereās a bit of frustration with this node which lost some files and is now ābroken for everā ^^ā
Surely Iām not the only one in this case (thereās at least @andrew2.hart too ).
Just in case some workarounds could be found. If at least satellites were flagging files as not held by this node anymore whenever an audit fails, it would get back to a healthy state way quicker I think.
Just my 2 cents
Too expensive. Each file on your node should be audited, it will take a lot of time and money: every single piece should be downloaded from your node for the same rate as a repair, but repair job will download only required minimum number of pieces when time to come, not all as in case of full audit.
So itās much cheaper for the network to just fix it sometime in the future.
So basically thereās really nothing that could be improved it seems to āfix these kind of broken nodesā
Thanks anyway for your responses @Alexey
While I feel for your situation I guess Iām a bit more of a hard liner on data loss. I donāt think your node should be allowed to survive if it has lost that many files. Audit scores recover too quickly for my taste allowing nodes to end up in this endless loop of scores dropping and recovering. If it took a month or even a week to recover from a failed audit, your node would be out of the running, but it would still allow for some small file corruption or a slow/no response.
Bloom filters are tricky. They donāt represent a list of all pieces that should be there but rather a pattern that all pieces that should be there would match. Itās a means of compressing a long list of IDs into a very small package. The trade-off is that it will match everything that should be there, but also some things that shouldnāt be there. About 10% of garbage will match the bloom filter and wonāt get cleaned up on the first run. It will be cleaned up in future runs though.
This basically means it will match 10% of all possible piece IDs. So clearly it canāt be used to determine what pieces you are missing.
Your node could also never tell the satellite to repair data. Even if all other issues could be overcome, at most it would tell the satellite a piece got lost, but the satellite would still only initiate repair when the available amount of pieces drops below the repair threshold.
The purpose for audits is to judge whether a node is reliable. Not whether specific data is still on a node. Only a tiny fraction of a nodes data is ever audited, so taking action on those individual pieces is nonsensical. Just look at the miniscule amount of audit traffic your node sees. Itās negligible.
And I agree with you @BrightSilence
But if, based on data science (and probably a lot of complicated statistical stuff I donāt get) my node is still considered as āreliable enoughā by all satellites apparently, Iām just a bit disappointed that I cannot repair it so itās back to a clean state. Looks like Iām kinda stuck with a slightly broken node for ever.
Apparently! Thanks a lot for the explanations.
Well maybe there are other waysā¦ sending a list of files with a hash for each one, if the node asks for it? So it knows whatās missing or corrupted?
I guess youāll tell me again that it would be a lot of work for the satā for almost nothing.
This particular node has been configured to hold 500GB of data only after the āincidentā, I guess itāll stay like this. Or Iāll just kill it as it did not really bring any kind of revenue so far, itās too young.
But it really feels like it would cost a lot less to the network to fix a couple of GBs that are missing, instead of killing the whole Node and repairing the 250+GB of data it now holds. I donāt knowā¦ I canāt stop thinking āthere must be a better wayā.
humbug! (with respect)
I have come to see the failed audit as exactly āI have lost that block sorryā If the satellite asks again, well, guess what? Still lost. It stops asking eventually.
The network is trust-less and doesnāt allow for generosity soā¦anyone need any free initial eth? ;D