Missing 5 .sj1 files of 3.2 million - end of node? 'Resilver' of objects?

nzdk · May 30, 2021, 2:00pm

I have a 4 TB node. The Seagate drive gave a timeout in logs. I promptly moved to another drive.

(… this also guves a 100% failure rate of all Seagate desktop drives I’ve used for 24/7 operations so far; which is 8 in total. Some smaller for day to day usage are fine still.)

I now have ~3+ million files copied off to a new drive, and the node runs again. However - 5 of the .sj1 data blobs couldn’t be copied off. I use btrfs on this node, so I know the other files are good.

Should I toss the node based on the 99,99998+% availability of the other objects? Or should I wait to have it DQ’ed based on a validation of exactly those files? It is unlikely to happen shortly, but might eventually.

The files owner will get his files back based on the erasure coding, so the node not a huge liability to the network; but it’s not a perfect node anymore.

Should the project have a object-level resilvering function where a node operator could request - which he pays for himself - to have those files repaired back on his node to get it back to 100% state, based on the operator knowing which files to ask the satellite to repair on his node?

Should nodes have a ‘resilvering function’ to pay for repair back on own nodes, based on the satellite knowing which files should be there?

Discuss.

(Also - what would you do in my situation?)

Thanks.

ACarneiro · May 30, 2021, 2:09pm

I’d keep the node going. You’d have to be pretty unlucky for audits to check the 5 files that are missing, and even if it does, it’s only 5 files so you shouldn’t be disqualified.

Keep it going. The worst that can happen is that you WILL disqualify later, but you even if that happens you’ve not lost anything.

Not sure whether adding a “resilvering” option would add a lot of development effort and would be low on the list of priorities, but the concept is interesting indeed.

andrew2.hart · May 30, 2021, 3:36pm

5 files is nothing. Don’t care about it again.

kevink · May 30, 2021, 3:38pm

yeah I lost 5 files due to corruption too. It won’t mean anything. 5 files out of millions will most likely not even get audited anytime. Don’t worry about it.

BrightSilence · May 30, 2021, 5:06pm

This subject has been discussed extensively here. Storage node to Satellite : I've lost this block, sorry

I could get into it, but I’ve wrote my opinion on object level repair there already. Your node will be fine. Don’t worry about those 5 files.

ACarneiro · May 30, 2021, 10:02pm

Oh, yeah. I’d forgotten about that thread.
Honestly, my memory is terrible these days…

SGC · May 31, 2021, 8:27am

for an old node surviving loss of over 10% of its files seems like a small issue…
tho with few files for a satellite one can quickly get DQ … but beyond that when the node actually has files, its much more resistant to loss of data than i would have thought…

the current best estimate of how big a % of files can be lost without DQ is about 20-25%

so yeah i lost a few files on a bad migration a while back… barely even made the audit % move for long… .did initially drop to 87% audit last week but as of today.
its at 99.9% … seems like audit recovered very fast compared to stuff like online
i almost doubt that it was there in the first place… because it was barely two days or something before it snapped back to near 100%.

but even if i just saw it wrong, then it was still down to 93% on another satellite… which also snapped back…

so yeah long story short… a few files doesn’t matter, only if you are extremely unlucky or on a very new node… and its usually a combination of both.

BrightSilence · May 31, 2021, 6:37pm

This has bothered me for a while now. The current score system is very flawed. Technically if you have 10% data loss, you will get disqualified at some point, but it can take millions of audits to finally hit the bad luck moment for that to happen. How do I know? Well… this has been kind of a thorn and the 20-25% number mentioned by @SGC seemed outrageous. However, I suppressed my initial instinct to just ask for the source of that info and figured I’d create a simulator.

This is the result:

This graph shows the behavior of the score for a node with 15% data loss. As you can see this node got disqualified, but not because it consistently lost reputation. Instead it basically got disqualified by a statistical outlier. If you run this simulation several times, most of the times the node actually survives the 10K audits in this simulation. I will post more of my findings and a suggestion for improvement in a separate topic/suggestion.

Edit: Suggestion created now. For anyone who’s interested it can be found here: Tuning audit scoring

SGC · June 1, 2021, 8:54am

so really it wasn’t really that far off…

and i guess the whole ingress deal might make a huge difference for newer nodes… as their ingress is continually accelerating as they get vetted and thus their amount of stored data grows with significant amounts each day and thus due to the random ness of the current method…

in theory atleast massive % would then be able to be lost without DQ … if the node has limited data and ingress picks up a lot …

was a very interesting read
so my estimates wasn’t really that far off. but with only a few data points that is most likely mostly luck lol…
i was very surprise by the high numbers allowed… even 10% lost data is a lot

props for wanting to fix the audit system now, it’s been needing an overhaul for a long time.

i guess another thing that we can also learn from your extensive study, is that if a full node gets its data damaged, giving it more space to grow would increase it chances for survival.
hadn’t thought about that before now.

BrightSilence · June 1, 2021, 9:26am

It really wasn’t. That kind of surprised me as it allows for so much loss.

Though based on what I see, 20% loss will get disqualified in at most a few thousand audits. On bigger nodes that can happen in a matter of days. I get 1500+ audits per day.

But really to find the info on long term survival anecdotal evidence wasn’t enough. I chose 15% because it is right on that edge. We’re talking weeks, maybe months before even larger nodes with that amount of loss get disqualified. At 10% we’re talking years of survival possible. I don’t think that limbo is good for anyone. For Storj it means unreliable nodes stick around with an unkown 10% of data lost. For the node operator it means a node can survive and grow for a long time only to be disqualified much later. It would be better to make a quicker decision and have the node start over early when the losses aren’t that high yet.

nzdk · June 1, 2021, 10:03am

I believe the nature of how 10% missing files on a node would happen, based on HDD errors etc, typically won’t be that random, and would probably be caught by audits.

But sure, if I (with malicious intent) delete 10% ~randomly, it might be a different case.

Maybe an introduction of some linearity of the audited blocks would solve the problem?

Also, I’m glad I kicked the bucket on this topic.

(also, kudos for the work that went into the calculations/simulator; it gives foundation for a more realistic debate)

BrightSilence · June 1, 2021, 10:34am

Sequential on the physical disk doesn’t always translate to sequential in the logical representation of the data. It usually doesn’t at all. And on SSD’s there is literally no relation. It might help a little in some scenarios, but I’m not sure it’s worth the added complexity of the auditing system. And you would still need the random checks to cover all other scenarios. So it doesn’t solve the problem.

ACarneiro · June 1, 2021, 10:37am

It would be really interesting if one of the Storj data analysts could chime in on this thread (although not expecting them to.

andrew2.hart · June 1, 2021, 5:13pm

My experience was that 300 lost files was very close to DQ on several occasions but as time passed these occasions became less serious and more spread out.
I concluded that the audit system works as a really slow scrub/repair.

There are no checksums on the blobs either. Madness!

BrightSilence · June 1, 2021, 8:30pm

This is not the case though, pieces that fail audits don’t get repaired. Your node likely just got more data so the missing pieces represent a smaller percentage of total pieces.

There are though, it’s called erasure coding. They just aren’t stored on your node.

Well I know they keep an eye on suggestions, especially those with lots of votes. So best way to get their attention is to just vote on the corresponding suggestion here: Tuning audit scoring

Pac · June 1, 2021, 9:27pm

I’m surprised by your results. In the thread you linked previously, I reported having lost 5% of files on a Node, and it hit the dangerous score of 70% at some point.
But because it was very young, it was still receiving data, and little by little the percentage of lost files decreased. I thought that was the reason it did not get disqualify.

I would have thought that 10% of lost file would have been a high way to DQ for sure!

Statistically I guess you’re right, although if the node keeps growing, it means the percentage of lost files keeps decreasing, reducing the odds of getting disqualified (like my node did: Today it rarely fails audits, but still does time to time which drops the score around 94% before going back up in the following days).

And… Surely some advanced statistical models determined that the current audit system is robust? Data scientists know better, right?

SGC · June 6, 2021, 9:55am

ap1 dropped again now, so it does seem like it repeats, tho i wouldn’t exactly call it proof yet… anyways thought i would pass on the data.
this was also the satellite which seems most affected it was the one that was down to 87%

8.76TB*h this month for ap1… so i’m going to underestimate that to 5 days
8760gb / 120 hours =73 so the node is storing atleast 73GB for the ap1 satellite.

and yes i am aware of my cough slight downtime issue… my server os decided to crap itself.

BrightSilence · June 6, 2021, 10:03am

What are your success rates like for this node?
Btw, 95% can happen after just a single failed audit. It usually recovers fast. Since you know the data loss issue isn’t causing more data loss, you don’t have to worry about getting disqualified. You’ll just see intermittent drops from time to time.

Although, if you don’t see failed audits in the logs, it may be caused by system performance/bottlenecks. That would be worth looking into. But my guess is you just have a log line for a missing file.

SGC · June 6, 2021, 10:19am

yeah its a missing file…

just pull it up because we talked about if it was recurring, and well it does seem to indicate you was right…
will be interesting to see how often much it will happen…
ofc also not a lot of data to that satellite…

73GB is only lets say 36500 x 2MB files… and then if it was just that and one other satellite doing most of the ingress when i made my mistake of deleting the new files when finish the migration.

and so lets say it was 10 files… hell even 100 is already 0.3% data loss
and if 2.1% is lethal long term… it could be close to 10% of what would be required to be DQ long term on that particular satellite…

i know i scaled the number of files lost by 50… but i always said a couple of files… i have no real idea of the exact amounts but like 66 files lost would put it at 0.2% lost which is like 1/10 of long term DQ…

thats not a lot of files…
and DQ is 59% so doesn’t seem that unlikely that less could get me to 87% but honestly i got no real idea, i just know what i saw and how it happened…

and it doesn’t leave much room for speculation… we can speculate about how many files it was because my logs was and is now even more f’ed because of proxmox straight up dying on me…

so now i got a ton of extra work… yay

i guess the % of lost data also depends on how that is calculated… if its capacity or file based.
i suppose that would be file based… which would make one able to loose much more capacity wise… i duno… seems like that might also be a weak point… because the odds of auditing small files would be much higher

BrightSilence · June 6, 2021, 10:33am

This is really not accurate. It used to be kind of accurate when most data was test data, which used large files for the most part. But most pieces are much smaller. And ap1 is also the satellite that for some reason had tons of 512 byte pieces. Just look at the success rates. That should give you a much better indication of how much you’ve lost. I’m fairly certain you’ll have much more files than you calculated (you can just count the files in the folder for this satellite).

I think you’re referring to if they will implement my suggestions. What I can say is that, yeah, long term it will be lethal, but at 2.1% it can take a very long time still. And by that time your node wil have gotten more data and the bad data would drop in %. Also if they implement my bonus suggestion as well, having parts of that old data be repaired would also help reduce the amount of bad files.

But in general we’re kind of shooting in the dark without more data. Something doesn’t really add up though. If your node only lost data from a brief period, a satellite with not a lot of data would also not have sent you a lot of data in that period. If you do have 0.3% of lost files, that would be really curious as it suggests you have missed pieces of about 0.3% of the total lifetime of your node. If it really is 6 months old, that would be roughly half a day. I doubt you lost that much, so it’s probably a lot lower.

I’m not sure what that is referring to, but yeah, logs will give you the best estimation of what % might be lost. Try to get logs for a long period of time to get the best estimate.

I can answer that, it’s piece based. The satellite first picks a piece to audit and then picks a random stripe within that piece. So the piece size doesn’t matter, just the amount.