Audit weirdness (us2 satellite). (Team aware. No additional information is needed)

LinuxNet · July 23, 2021, 8:47am

I am now seriously wondering what happens if Nodes is disqualified because of this error?

How does Storj intend to compensate the node operator? We SNOs are obviously not responsible for this…

My audit grew again overnight, but is now falling further.
This happens on several nodes, so I can rule out problems on the node side.

SGC · July 23, 2021, 9:18am

This a storj network software issue, most likely related to the latest update…
so if anyone is on the old version lets hear if you are seeing this…

also tried apt upgrade my container and a cold reboot of the container… but no joy.

@LinuxNet i will most likely shutdown my nodes if it comes to that, the 80% one is a prime candidate for it…
but i rather want to avoid that… but it should keep the node safe… since we have 12 days of allowed downtime until suspension… not great for the network…

but i can only assume that DQ means DQ since it has done so in the past… sure this is a wider issue, but people have lost nodes for no reason of their own… so cannot really assume a DQ will just get fixed.

BrightSilence · July 23, 2021, 10:06am

Let’s also try to keep misleading info to a minimum. I believe you said elsewhere that the audit score drops on other satellites had to do with a migration you did. As of now I don’t have any reason to think the delete issue is wider spread than ap1. Please look up logs for other satellites if you think they’re having the same issue. Also, this topic was originally about a different issue on us2. For the issues with ap1 deletes, best to stick to this topic to keep everything in one place.

SGC · July 23, 2021, 10:21am

yeah thats right, i haven’t gotten around to correcting it… remembered after i had written it here.
deleted it for simplicity sake.

KernelPanick · July 23, 2021, 1:14pm

Started approx 7/22 15:40

Alexey · July 24, 2021, 1:35pm

A post was merged into an existing topic: Audit falling on eu1

hoarder · July 23, 2021, 10:33pm

Have this issue on some of my nodes. AP1 and US2 are affected. Saving log files on all nodes now.
In some cases I wasn’t able to see any deletes in logs, but I refuse to believe it’s a coincidence.

Alexey · July 24, 2021, 1:34pm

A post was merged into an existing topic: Audit falling on eu1

cynicaldrink · July 24, 2021, 11:40am

I might have a related problem on us2 also.

SGC · July 24, 2021, 11:41am

hehe i think we can call this confirmed lol pretty spot on for this node…

one piece seems to be this one

2021-07-24T09:02:48.922Z INFO piecestore download started {"Piece ID": "TMH2O4Q2YG7CQFWFTWDY7C5SMZCVUCKEICW53XI47XXIO2RKH5LA", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Action": "GET_AUDIT"}

can’t find anything else in my logs about this piece, but they only go back to the 16th and my logging export got overloaded when we got the massive delete failures the last little while.
so they are also full of holes

2021-07-24T09:02:48.937Z ERROR piecestore download failed {"Piece ID": "TMH2O4Q2YG7CQFWFTWDY7C5SMZCVUCKEICW53XI47XXIO2RKH5LA", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Action": "GET_AUDIT", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:534\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:217\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:102\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:95\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

this is weird tho… i cannot seem to find the log entry about the recent audit failure on ap1…

SGC · July 26, 2021, 11:56am

still seeing unexpected audit failures today on us2, but still quite rare…
just wanted to mention it, as i’m unaware if this issue was assumed fixed…

2021-07-26T11:17:54.655Z ERROR piecestore download failed {“Piece ID”: “YHPWJD5XNXXMYLCSIKSE557UOZT5PNMVMRMVKZYFNWGGQQJG7MXA”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “Action”: “GET_AUDIT”, “error”: “file does not exist”, “errorVerbose”: “file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:534\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:217\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:102\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:95\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51”}

and

2021-07-26T11:17:54.818Z ERROR piecestore download failed {“Piece ID”: “CY5Z3ID3VLJ4U6ANAWO336ZHXJAPFKIE4H2JCDUXGR6B3SSG3OZQ”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “Action”: “GET_AUDIT”, “error”: “file does not exist”, “errorVerbose”: “file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:534\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:217\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:102\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:95\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51”}

incase of timestamps would help…

andrew2.hart · July 26, 2021, 6:30pm

Failed an audit on piece 6ZXYFYNYUJ…phew I thought I was missing out!

Alexey · July 26, 2021, 8:25pm

I’ll update the thread when the information would be available.
No actions are needed.

Alexey · July 28, 2021, 8:10am

The issue should be resolved now.

Alexey · August 2, 2021, 9:01am

This is our hypothesis about what’s happened.

US2 from the beginning was using objects and segments table (metabase). During one of deployments, we had an issue where migration couldn’t be executed because a number of entries was too big. Because it was US2 we decided to do some hot fixes directly on database and we deleted some entries. Most probably during this deletion we managed to make some segments without a corresponding object. We discover that while working on the different issue but there was no good moment to start working on removing such orphaned segments.

Problem becomes visible when we moved audit to segment loop after the last update. Earlier orphaned segments were ignored while loop because metaloop was doing first object and then its segments. Now each segment is processed, no matter if it had its object or not.

Audit was failing because repairing was working in the same way so it means that orphaned segments were not repaired for a long time, most probably a few months.
We fixed this issue too.
On production satellites this situation was hardly to happen though, the orphaned segments has been removed a long time ago and the new ones normally cannot be produced with the current code. Or at least they will be processed as a normal segments during audit and repair.

BrightSilence · August 2, 2021, 11:12am

Thanks for the extensive updates on both issues discussed related to audit failures. I appreciate the in depth feedback!