The audit score went down again

TWH · August 31, 2024, 4:15pm

I noticed the node was still getting these errors and that the audit score went down again. After checking the logs, it looks like the audit download is failing. The nodes internet connection has been having issues lately, so the failed audits i am seeing could be unrelated to the migrate issue. Is anyone else having audit failures?

docker logs storagenode 2>&1 | grep GET_AUDIT | grep fail
2024-08-23T01:41:31Z    ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "MGFDXEYZKSNRHPQYYZYDKFYYUUBCXID6CEOKWXV4RSE5G2JMDYEA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "Offset": 652288, "Size": 256, "Remote Address": "34.124.214.89:43212", "error": "manager closed: read tcp 172.17.0.3:28967->34.124.214.89:43212: read: connection timed out", "errorVerbose": "manager closed: read tcp 172.17.0.3:28967->34.124.214.89:43212: read: connection timed out\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:235"}
2024-08-23T01:43:19Z    ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "DMSHEF5LQ4UPIJY23KW3L6P3SKIEYKECWYEP6RY6XZSPFTXFMMTQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "Offset": 1640448, "Size": 256, "Remote Address": "35.197.134.129:37419", "error": "manager closed: read tcp 172.17.0.3:28967->35.197.134.129:37419: read: connection timed out", "errorVerbose": "manager closed: read tcp 172.17.0.3:28967->35.197.134.129:37419: read: connection timed out\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:235"}

Alexey · September 1, 2024, 5:54am

The audit score may fall, if the node has timeouts for GET_AUDIT requests, but for that it should be online and respond on audit requests but should be unable to provide a stripe from the piece within 15 seconds timeout. Such piece will be requested three more times by GET_REPAIR later. If all attempts would fail, the audit will be considered as failed.

Usually it’s a result of some hardware issues, where the node is able to respond, but unable to provide the requested piece. The readable check should crash the node in that case, unless you changed a readable check timeout from the default 1m0s to some very high value and now the node is unable to detect a hardware issue.

So the offline issue is unrelated, in that case the node will be unable to respond on the audit request to be able to fail it.