Missing Piece before the wipe? - Critical audit alert

erosfabbri · October 2, 2019, 1:03pm

Where did you get this dashboard , i want it too!

donald.m.motsinger · October 2, 2019, 2:13pm

Yes, because the “missing” files were not audited yet. You need the complete log to be sure

KernelPanick · October 2, 2019, 4:31pm

Check out this thread:

Sasha · October 2, 2019, 8:50pm

Negative, If you scroll up and look at what the real problem was originally this failed audit issue will continue to occur until the satellite stops asking for piece ID’s the SNO never received. Or were deleted but Satellite is still asking for that piece ID for audit.

flo · October 4, 2019, 7:12pm

Same here - got AUDITs of pieces that have been deleted.

Odmin · October 4, 2019, 7:26pm

I think fix still in progress because it not described on changelog

Sasha · October 4, 2019, 11:04pm

@littleskunk, what would make this more robust;
If a piece ID failed an audit, storage node would be instructed to download a repair on that piece ID from the network.

I thought this was in the white paper, but I wonder when this will be released to SNO’s?

littleskunk · October 4, 2019, 11:21pm

That would make it more expensive but not more robust. Who will pay for the repair traffic?

kevink · October 5, 2019, 5:35am

In this case the node that lost the data could pay a heavy price.
But if that makes the network more robust is probably doubtful. If you lost one piece, you could have lost multiple other pieces too but those might not get audited within the next weeks.

littleskunk · October 5, 2019, 9:43am

We haven’t lost a single file in a few month now. Our current durability is 100%. Why should we spend extra money? More robust than 100%? That doesn’t work!

Sasha · October 5, 2019, 10:29pm

Egress traffic isn’t paid to SNO’s.
Ingress from another node that has passed an audit would be considered egress repair on the SNO that failed the audit.

Who pays for repair traffic now?
Is repair mechanism in place at present?

This was only brought up due to the delete then audit fail bug discovered. Once this is resolved this will be more robust and SNO’s wouldn’t have to deal with failed audits for piece ID’s that they no longer store or; never received. Secondary observasion that was discussed is related to how a satellite uploads to SNO and how the confirmation of that upload to SNO is handled. If the handshake of the upload isn’t handled appropriately we are still experiencing symptoms where SNO never received a Piece ID where Satellite requested an audit for.

2 issues observed so far:

Piece ID deleted, then audit for that piece ID failed afterwords affecting the score.
Audit failed for a Piece ID the SNO never received.

For item 2, this may or may not be related how the Storage Node is shut down or restarted during an upload in progress or in completion which was interrupted, where Satellite assumes SNO has the piece ID but SNO never confirmed therefore it never physically existed.

On a separate note:
If an SNO failed an audit for a Piece ID, does the satellite continue to ask for that piece ID from the SNO or is this action now a once off? Ie avoid spamming request for the SNO failing audits for the same Piece ID, which would almost immediately push the SNO into the bad score of 0.59 which I experienced early on.

littleskunk · October 6, 2019, 1:20am

To repair a single missing piece someone has to download 29 pieces and reconstruct the file. That is huge amount.

Yes ofc. Again we haven’t lost a single file in a few month now. So why should we change it if it doesn’t increase the durability?

Wait what? The audit failed because the file was deleted from the storage node! Why should we make it even worse by putting it back to the storage node? That makes no sense.

I agree on that except the part with more robust. So once we have resolved the bugs around audit there is no need to change anything because nobody should have to deal with failed audits. Your proposal doesn’t change that fact.

That observasion is correct, meanwhile fixed and again no change to the audit system needed.

You got close but this assumption is not correct. As soon as the storage node received an upload stream it will write a log message. There is nothing that could interrupt the storage node and remove a log message. → The upload itself never started.

Yes. Right here: storj/satellite/audit/verifier.go at c1fbfea7fac2fdb883438e67473abd4d88d4800e · storj/storj · GitHub

We call that containment mode. You will receive the same audit request 3 times in a row and only if the last one fails the score will be updated. This is needed to make sure nodes can go offline doing an audit request and will not get a penalty for that. As soon as they come back online they will get a second and a third chance to avoid any penalty.
There are a few errors messages that will count as an failure in the first run. If the storage node response with “I don’t have that piece” there is no reason to ask 3 times the same question. It will not change the outcome.

Sasha · October 6, 2019, 6:33am

Awesome, thanks for clarifying.

greener · October 7, 2019, 12:03pm

I also got the piece ID deleted, then audit for that piece ID failed afterwords affecting the score. Noticed this when my score dropped. Piece was deleted 16 days before audit attempt:

# docker logs  storagenode 2>&1 | grep -i audit | grep -i fail
2019-10-07T05:08:34.491Z        INFO    piecestore      download failed {"Piece ID": "GBJQEXIVBWX5VT46IBLN2NY4JQH7DIX67YXKF4HY6LIQW5FWFJFA", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT", "error": "rpc error: code = NotFound desc = file does not exist"}
# grep -i GBJQEXIVBWX5VT46IBLN2NY4JQH7DIX67YXKF4HY6LIQW5FWFJFA logs/*
logs/25-09-19.log:2019-09-21T16:39:27.420Z      INFO    piecestore      deleted {"Piece ID": "GBJQEXIVBWX5VT46IBLN2NY4JQH7DIX67YXKF4HY6LIQW5FWFJFA"}

Does this fall under known bugs discussed it this thread? Thought this was fixed already.

nerdatwork · October 7, 2019, 12:29pm

During the last delete script there was a bug that was auditing deleted pieces. I am not sure if a fix has already been implemented yet.

John.A · October 7, 2019, 2:08pm

I think that the fix is realesed already

nerdatwork · October 7, 2019, 2:43pm

This seems like a problem. Audit of those deleted pieces can happen any day. I think every one should recheck their logs just to make sure they don’t have any more of such pieces.

sorry2xs · October 7, 2019, 2:51pm

I have seen some get repairs in my logs in the past 48 hours

Odmin · October 7, 2019, 2:57pm

I also see that satellite is asking pieces that not exist (or delated), see details below:

Summary

root@docker01:/home/odmin/storj_success_rate# docker logs storagenode 2>&1 | grep GET_AUDIT | grep 'download failed' | awk '{print $8 $10}' | sort | uniq -c | sort -g -r
      1 "X7PV6JCU4FIFAN3B6PSGEXWDORBXKHZEATYSMBRHA2FKDGKEJ7NA","118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW",
      1 "U73KXML2OQ7SAXFKABISHFVXN7YCQNUH7JXD2W7XISEZR3DAA5BQ","118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW",
      1 "NYLOU6YJEKBOR2K3B2JI7K2KH5I7EVVXHVGE7GLMSXPLFYR5KLDQ","118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW",
root@docker01:/home/odmin/storj_success_rate#
root@docker01:/home/odmin/storj_success_rate#
root@docker01:/home/odmin/storj_success_rate# docker logs storagenode 2>&1 | grep "X7PV6JCU4FIFAN3B6PSGEXWDORBXKHZEATYSMBRHA2FKDGKEJ7NA"
2019-10-04T02:49:25.931Z        INFO    piecestore      download started        {"Piece ID": "X7PV6JCU4FIFAN3B6PSGEXWDORBXKHZEATYSMBRHA2FKDGKEJ7NA", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT"}
2019-10-04T02:49:25.931Z        INFO    piecestore      download failed {"Piece ID": "X7PV6JCU4FIFAN3B6PSGEXWDORBXKHZEATYSMBRHA2FKDGKEJ7NA", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT", "error": "rpc error: code = NotFound desc = file does not exist"}

John.A · October 7, 2019, 3:28pm

Thought i read something about it being realesed but i dont know where and when. Might be my head made it up. If not ill guess devs working in it.

Maybe @Alexey knows about this