Hi Thanks for the response.
I’m currently running a workflow to deep look at the pieces involved in the DQ… it’s going take a long time as lots of systems to work through, and many rows however.
Top looking at early results, there is a trend which I wasn’t expecting on AUDIT.
As an example, one of the pieces involved in the 10% DQ traced is as below.
Aug 8th - GET_AUDIT - Piece ID AAA, node responded with “ERROR - file not found” - Time Delta hh:mm:30.421
Aug 8th - GET_AUDIT - Piece ID AAA, node responded with “ERROR - file not found” - Time Delta hh:mm:30.421
This was a duplicate request processed, call trace below.
file does not exist
storj.io/common/rpc/rpcstatus.Wrap:73
storj.io/storj/storagenode/piecestore.(*Endpoint).Download:546
storj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228
storj.io/drpc/drpcmux.(*Mux).HandleRPC:33
storj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58
storj.io/drpc/drpcserver.(*Server).handleRPC:122
Then;
Aug 16th - GET_AUDIT - Piece ID AAA, node responded with “ERROR - file not found” - Time Delta hh:mm:05.341
Aug 16th - GET_AUDIT - Piece ID AAA, node responded with “ERROR - file not found” - Time Delta hh:mm:05.341
Again, duplicate request processed for the same PIECE_ID (this really confuses me, my understanding was that a failed audit on the piece marks it off the node, and could trigger a repair)
Then;
Aug 24th - GET_REPAIR - Piece ID AAA , node responded with “ERROR - file not found” - Time Delta hh:mm:18.040
No duplicate on this one, again same Piece ID that has failed 2 audits, and this repair is part of the “Cluster of Non-Random pieces”
So,
-
Why are we sending in some cases duplicate GET_AUDITS ? I’m really surprised the delta is exactly the same - again I haven’t got all the details, but the firewall segment offset stamped on the data is the same, which either means there was only 1 request received and the node has managed to duplicate this in the StorageNode code or two parallel get audit requests were sent over clearnet at EXACTLY the same time (this is hard, but not impossible - and can often be a side effect of leaking clusters, or bad network teaming, or even network card drivers)
-
Why after a clear failed audit, where the file is not available do we retry again days later - it’s clearly going to fail again - my understanding was the file would be marked as lost on that node, and no further requests would be made, but this doesn’t seem to be the behaviour seen.
-
The start of the looking was the GET_REPAIR, it was part of the 10% node burst of pieces and shares the same PIECE ID, as the previous failed Audits… why was the piece still being tagged to the node, with 2 failed audits days before…
…I’m still checking node and disk IO - it would be interesting if the duplicates are linked to busy disks causing timeouts.
#edit: nope, the latency from start download request, to node reply with failed download is < 200ms on average
#edit: also thinking, is this database issue with transaction locking and rollback when the duplicate requests happen - so the synchronised update to satellite causes no update to database, or rollback as locking or sync issue detects failed transaction and default is to rollback, ,therefore no update to piece - that would be bad as failed pieces would not be represented as lost. (again not bothered to look at schema, so this might not even be possible through constraints)
CP