This a storj network software issue, most likely related to the latest update…
so if anyone is on the old version lets hear if you are seeing this…
also tried apt upgrade my container and a cold reboot of the container… but no joy.
@LinuxNet i will most likely shutdown my nodes if it comes to that, the 80% one is a prime candidate for it…
but i rather want to avoid that… but it should keep the node safe… since we have 12 days of allowed downtime until suspension… not great for the network…
but i can only assume that DQ means DQ since it has done so in the past… sure this is a wider issue, but people have lost nodes for no reason of their own… so cannot really assume a DQ will just get fixed.
Let’s also try to keep misleading info to a minimum. I believe you said elsewhere that the audit score drops on other satellites had to do with a migration you did. As of now I don’t have any reason to think the delete issue is wider spread than ap1. Please look up logs for other satellites if you think they’re having the same issue. Also, this topic was originally about a different issue on us2. For the issues with ap1 deletes, best to stick to this topic to keep everything in one place.
Have this issue on some of my nodes. AP1 and US2 are affected. Saving log files on all nodes now.
In some cases I wasn’t able to see any deletes in logs, but I refuse to believe it’s a coincidence.
2021-07-24T09:02:48.922Z INFO piecestore download started {"Piece ID": "TMH2O4Q2YG7CQFWFTWDY7C5SMZCVUCKEICW53XI47XXIO2RKH5LA", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Action": "GET_AUDIT"}
can’t find anything else in my logs about this piece, but they only go back to the 16th and my logging export got overloaded when we got the massive delete failures the last little while.
so they are also full of holes
2021-07-24T09:02:48.937Z ERROR piecestore download failed {"Piece ID": "TMH2O4Q2YG7CQFWFTWDY7C5SMZCVUCKEICW53XI47XXIO2RKH5LA", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Action": "GET_AUDIT", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:534\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:217\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:102\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:95\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}
this is weird tho… i cannot seem to find the log entry about the recent audit failure on ap1…
US2 from the beginning was using objects and segments table (metabase). During one of deployments, we had an issue where migration couldn’t be executed because a number of entries was too big. Because it was US2 we decided to do some hot fixes directly on database and we deleted some entries. Most probably during this deletion we managed to make some segments without a corresponding object. We discover that while working on the different issue but there was no good moment to start working on removing such orphaned segments.
Problem becomes visible when we moved audit to segment loop after the last update. Earlier orphaned segments were ignored while loop because metaloop was doing first object and then its segments. Now each segment is processed, no matter if it had its object or not.
Audit was failing because repairing was working in the same way so it means that orphaned segments were not repaired for a long time, most probably a few months.
We fixed this issue too.
On production satellites this situation was hardly to happen though, the orphaned segments has been removed a long time ago and the new ones normally cannot be produced with the current code. Or at least they will be processed as a normal segments during audit and repair.