Storage node to Satellite : I've lost this block, sorry

littleskunk · August 21, 2020, 9:13am

Yes that is correct. One little detail. The repair job will download 29 pieces. If you are not in that subset the repair job will not notice the missing piece and not replace it. So there is no garantee that the repair job will notice it but it is likely.

Pac · August 21, 2020, 10:06am

Thanks for these details.

Something feels weird to me then: Initially, if the number of pieces of a segment were to fall below 35, the repair job would repair only 29 pieces, so the total number of available pieces for this segment would be 64.

So… why are we uploading 80 pieces to the network when uploading a segment at first, if this number is never used again down the line? Segments will end up having between 35 and 64 pieces with time.

Or maybe you meant the repair job is going to download only 29 pieces because that’s enough for reconstructing the whole segment, and then it’s going to restore all the 80 pieces (or any number it wants)?

Actually I had forgotten this had been discussed a long time ago here:

And it seems it’s actually up to each satellite to handle this the way they want, in step with security, costs, … So I guess I kind of answered my own question

littleskunk · August 21, 2020, 4:21pm

The satellite will download 29 pieces. If some of these pieses are corrupted or missing it will download additional pieces until it hits 29 or has contacted all storage nodes for that segment.

If the satellite was able to recostruct the file it will recreate the missing pieces. For this step it also takes into account all the download errors in the previous step. It will replace them. It will then upload these new pieces to new storage nodes. We expect a few of these uploads to fail. For that reason we target for 80 + a few extra. At the end of repair the segment should have around 80 pieces ± a few upload errors.

andrew2.hart · August 21, 2020, 5:12pm

I didn’t want storagenode9 to get suspended, as this doesn’t help with the question of storagenode1. So I’ve put the data back. Just not the data it will be expecting.

I think it is conclusive …

If the blobs are corrupt then audits pass at the storagenode end

2020-08-21T17:50:08.567Z	INFO	piecestore	download started	{“Piece ID”: “W2D7LKOATVEZPNSI7EDPUJTHXIC4YEES6PUBVRPXT6RYOBUJN35A”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “GET_AUDIT”}
2020-08-21T17:50:08.726Z	INFO	piecestore	downloaded	{“Piece ID”: “W2D7LKOATVEZPNSI7EDPUJTHXIC4YEES6PUBVRPXT6RYOBUJN35A”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “GET_AUDIT”}
2020-08-21T18:18:09.723Z	INFO	piecestore	download started	{“Piece ID”: “4HQW3LQHATPJ4CQRMAMDZNVP6ZKEWCVD3ROUXBU7CGJOZZR7XSHQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET_AUDIT”}
2020-08-21T18:18:09.901Z	INFO	piecestore	downloaded	{“Piece ID”: “4HQW3LQHATPJ4CQRMAMDZNVP6ZKEWCVD3ROUXBU7CGJOZZR7XSHQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET_AUDIT”}
2020-08-21T21:19:40.446Z	INFO	piecestore	download started	{“Piece ID”: “FCAC32LW33GHPJZCWOV4QEHC6PD3BSWKQ6E3QFG7PT5AVWKLUV2Q”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “GET_AUDIT”}
2020-08-21T21:19:40.666Z	INFO	piecestore	downloaded	{“Piece ID”: “FCAC32LW33GHPJZCWOV4QEHC6PD3BSWKQ6E3QFG7PT5AVWKLUV2Q”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “GET_AUDIT”}
2020-08-22T01:49:45.544Z	INFO	piecestore	download started	{“Piece ID”: “HB7GFAT5N6W4PR6GSZDBSXUK4KBRVSRHNHLUDWF6SYBEKWX44NVQ”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “GET_AUDIT”}
2020-08-22T01:49:45.763Z	INFO	piecestore	downloaded	{“Piece ID”: “HB7GFAT5N6W4PR6GSZDBSXUK4KBRVSRHNHLUDWF6SYBEKWX44NVQ”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “GET_AUDIT”}
2020-08-22T02:22:16.819Z	INFO	piecestore	download started	{“Piece ID”: “CNBGQGRYWOT7IEOPFW25LQEFQ32TLZXLJVONHCKKMAKS4ANCTIFA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “GET_AUDIT”}
2020-08-22T02:22:17.076Z	INFO	piecestore	downloaded	{“Piece ID”: “CNBGQGRYWOT7IEOPFW25LQEFQ32TLZXLJVONHCKKMAKS4ANCTIFA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “GET_AUDIT”}
2020-08-22T03:48:06.549Z	INFO	piecestore	download started	{“Piece ID”: “K64UTGIJPPVL3EWMSAF7OGSM7RFKBIRIG3VWUUSFLGSEKNSGWMZA”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “GET_AUDIT”}
2020-08-22T03:48:06.771Z	INFO	piecestore	downloaded	{“Piece ID”: “K64UTGIJPPVL3EWMSAF7OGSM7RFKBIRIG3VWUUSFLGSEKNSGWMZA”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “GET_AUDIT”}

No fails since I copied in corrupted data to all the blobs

SGC · August 22, 2020, 8:29am

there are so pretty good data recovery software, often even when the OS or rsync isn’t able to read a file then it’s because of certain latency failsafes, to allow for continued system operation / Host Bus even with some devices on the bus having high latency…

thus a read will be cancelled because of such timeouts and not because it’s really unable to read the data, it’s just not able to read the data within the allotted time frame…

Data recover software mitigates this and can at times spend hours trying to read data from a broken drive, and depending on the software and hardware type it has many parameters to help it try to recover the data… thus when you have corrupted / lost data the data recovery software approach can be well worth the time…

i also found that sometimes when a drive sits long enough on a shelf it often is much easier to read the data… no idea why this is and difficult to really get much detail since i don’t have hundreds of drives to test on… and ofc it is without a doubt not advantageous in all cases…
and not really very useful for storagenode data… but has helped me recover data in the past…

and for the data recovery software you will most likely need a jolly roger version.
i mean i didn’t have 5 to 10k $ to shell out at a piece of software that i used once or twice in a decade

andrew2.hart · August 22, 2020, 8:59am

no there’s ddrescue, it is good enough
but you realise i trashed storagenode9 on purpose to replicate the failure to log audit fails on storagenode1, yes?

SGC · August 22, 2020, 9:21am

should have been a reply to this… forgot to quote it

@andrew2.hart
looks like some pretty interesting experiments you got going on there tho…
ill keep lurking for sure

andrew2.hart · August 22, 2020, 10:08am

oh yeah , sorry. on a phone I didn’t scroll that far back

Pac · August 26, 2020, 9:27am

Around mid-August, I noticed one of my nodes was starting to have problems (empty dashboard, audit score going down on a couple of sat’).

Did investigate a bit, repaired database files… But the problem kept getting worse, I ended up losing all databases for that node, so as I said in one of my earlier posts:

This transfer (with 2000 corrupted files) took place on he 18th of August.

Since then, I can see satellites are really unsure about the health of this node (and they are damn right!) and audit scores are constantly going up and down, sometimes little by little, sometimes with bigger leaps and I’m not sure why… it’s very inconsistant from one audit to another. I guess the answer is somewhere in the source code and the white paper…

To have a better idea of what’s going on, I coded a little script that keeps track of the audit score on this particular node, for each satellite, and results are available since the 21st of August in real time online, here: Audits N3 (scores) - ThingSpeak IoT

Thought that maybe it would be of interest for some of you guys, out of curiosity. So I’m sharing it.

For anyone reading this in the future, the link might not be available anymore, so here is a screenshot of the current state of these graphs (~~26th Aug.~~ EDIT: 19th Sept.):

littleskunk · August 26, 2020, 9:38am

That is just the nature of randomness. Lets say you are failing 2 out of 10 audits. Your audit score will be different depending on the order of these 2 audit failures. Best audit score would be one failure, 5 success and then the other failure. Lowest audit score would be 8 success and the last 2 are failures. Every other combination is somewhere between. The audit system is written in a way that 10ish audit failures in a row will trigger disqualification. So 2 audits failures in a row will have a bigger impact already.

andrew2.hart · August 26, 2020, 11:42am

Same here. My “fix” of adding some free TB to the node seems to be working and diluting the bad data. It is now usually between 98% and 90%. But at any it time it could go bang, bang, bang, bang and fail below 60%.

I also had a hunt for repeated audits, based on another thread somewhere. I found that 1 out 100 piece audits get a repeat audit at some point. No triple audits at all.

I recently wondered about matching the audits in the log with the drop in audit score and truncating them, as this will convert them to “suspended” audits. I got stuck and gave up but I think it could be done.

Pac · August 26, 2020, 12:24pm

@littleskunk I see! Makes sense: That does not surprise me that some kind of clever algorithm takes into account the last audits for gauging how reliable the node is

However, what happened to satellite “Europe North” is weird then: It had a score of 1 and took an incredible bad hit (1 to 0.875) because of one failed audit the 22nd of Aug

I mean… 3 failed audits in a row like this one and the Node is gone for good on this satellite!

However, I do not have the score history before that, so maybe it had other failed audits, but if it was the case I wouldn’t have expected its score to have come back up to a score of 1 already, it seems like recovering from failed audits usually takes time.

littleskunk · August 26, 2020, 3:12pm

Could you also track the audit alpha and beta values?

You can have 2 times a score of 1 but with different alpha values. That will also change how much impact a failed audit will have. It also impacts how fast you will recover.

BrightSilence · August 26, 2020, 6:42pm

While I feel for SNOs suffering file loss due to unforeseen circumstances, it kind of worries me that a node with as significant a loss of 5% is able to survive. Many SNOs have complained that the scores drop too fast, but I think the bigger issue there was that they dropped for the wrong reasons (a major one of which will be solved in v1.11). Personally I’m more concerned by the scores going back up fast enough to make this kind of loss survivable. Since the satellite is not aware of which pieces were lost by the node, allowing for a 5% risk factor seems very high. Luckily the repair threshold is set pretty high currently.

But if all nodes had lost 5% of data (which is obviously not going to happen, but just a worst case scenario), with the original RS settings of 29 minimum and 35 repair, there would be about a 0.5% chance of each piece being lost.

That sounds scary, but the network can obviously deal with a random very small amount of nodes like this. It’s just more than I would think the network should accept.

Pac · August 26, 2020, 9:37pm

I’ll be honest, me too. I really thought this node would die, and would have been fair.
I feel like I would do the network a favor by killing my node myself, but I really wanted to see these figures go up and down, out of curiosity.

I’ve always felt like the 35 threshold was dangerously low, but who am I to judge, data scientists know better.

This said, 5% of the files were lost initially, but because this node is very young, it keeps receiving data so lost files are more and more “diluted” in the sea of files it is holding. Still unfair, but today these lost files probably represent only 2% of all the files now.

Well, I guess I could but that wouldn’t be simple to display them on the same dashboard with my free account on Thingspeak. Or maybe on a separate dashboard but that wouldn’t be very practical…
It would be better to rework the whole thing so scores are on the same chart anyway, togglable and all.

Surely some other people have done that already?

What are alpha and beta values by the way? Is this explained somewhere?

thepaul · August 26, 2020, 9:53pm

This might be way more information than you wanted, but see these papers produced by our data science team: Reputation Scoring Framework and Extending Ratios to Reputation.

BrightSilence · August 26, 2020, 10:15pm

I’d say keeping it running and reporting back is even better. The repair threshold is currently set at 52, so the network will be fine. This is the time to expose potential weaknesses.

jammerdan · August 27, 2020, 3:39am

Maybe when audit score drops a node should receive less new data from the satellite and have more audits performed on it? This way it would be harder to hide non existent pieces in a sea of new pieces. I think of similar to the vetting process where data flow in is capped at 5%.

Otherwise it would be a funny way to prevent being dqed: Add much more space and hope to win the race that more new data flows in than missing data gets vetted.

kevink · August 27, 2020, 7:16am

Use the prometheus exporter and prometheus, then you can easily plot any value you need:

Pac · August 29, 2020, 9:35am

@kevink I guess that’d be better indeed. I did not take the time to investigate how these monitoring tools work so I won’t be setting that up anytime soon I think. Thanks for the suggestion though, I did not known there were a Prometheus connector for Storj

Good spirit

I’m tracking these too now, but I reused my script so it’s easy and fast to set up.
That’s far from ideal as all graphs are separate, but data could eventually be exported later and merged into more insightful graphs later…
Here they are (beware, x axes do not have the same scale because scores and alpha/betas weren’t started at the same time):

Audit scores: Audits N3 (scores) - ThingSpeak IoT (same as above)
Audit alphas: Audits N3 (alphas) - ThingSpeak IoT
Audit betas: Audits N3 (betas) - ThingSpeak IoT