Node suspended due to 'errors too often during audits' but shows no audit errors

ticktockbent · November 7, 2021, 3:00pm

A title says, a node was suddenly suspended on us1.storj.io:7777. The reason given was ‘it produced errors too often during audits’.

The node dashboard shows all audits at 100%, lowest uptime is 99.79%. Nothing has changed on the hardware or software for this device in over a year, aside from routine security updates. How would I go about fixing this when not much information on these errors is given?

I have examined the logs and they look normal.

This node is running in a docker container on a NAS device fwiw

SGC · November 7, 2021, 3:19pm

drops in suspension score usually happens due to failed audits, the reason this doesn’t show in the audit score is because the audits will be repeated later and thus failing once will not lower the audit score, if it’s successful when attempted later.

this can happen due to disk latency or such things which makes the node unresponsive or to slow to respond in time.

docker logs storagenode 2>&1 | grep GET_AUDIT | grep failed | grep -v open

run that and post your results

ticktockbent · November 7, 2021, 3:22pm

[~] # docker ps
CONTAINER ID   IMAGE                          COMMAND         CREATED        STATUS       PORTS                                                      NAMES
6ebccda8985d   storjlabs/storagenode:latest   "/entrypoint"   4 months ago   Up 10 days   192.168.86.49:14002->14002/tcp, 0.0.0.0:29003->28967/tcp   storagenode-temporary-1626646271
[~] # docker logs storagenode-temporary-1626646271 2>&1 | grep GET_AUDIT | grep failed | grep -v open
2021-11-07T10:27:14.625Z        ERROR   piecestore      download failed {"Piece ID": "PBXPA3BGXFUO2JORZAELZDW5HNUNC7T7IQMEAF2PVF54JY3HZ5CA", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET_AUDIT", "error": "order created too long ago: OrderCreation 2021-11-07 09:27:14.489451462 +0000 UTC < SystemClock 2021-11-07 10:27:14.625195959 +0000 UTC m=+849979.274835386", "errorVerbose": "order created too long ago: OrderCreation 2021-11-07 09:27:14.489451462 +0000 UTC < SystemClock 2021-11-07 10:27:14.625195959 +0000 UTC m=+849979.274835386\n\tstorj.io/common/rpc/rpcstatus.Errorf:87\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:45\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:464\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:217\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}
2021-11-07T14:25:19.686Z        ERROR   piecestore      download failed {"Piece ID": "ZZQSV4YHCIM7I7EI5TU2PHUP7TQUGVSZFD4OY4WSRL53PPFSEMWA", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Action": "GET_AUDIT", "error": "order created too long ago: OrderCreation 2021-11-07 13:25:19.597407516 +0000 UTC < SystemClock 2021-11-07 14:25:19.686623519 +0000 UTC m=+864264.336262906", "errorVerbose": "order created too long ago: OrderCreation 2021-11-07 13:25:19.597407516 +0000 UTC < SystemClock 2021-11-07 14:25:19.686623519 +0000 UTC m=+864264.336262906\n\tstorj.io/common/rpc/rpcstatus.Errorf:87\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:45\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:464\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:217\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}
[~] #

SGC · November 7, 2021, 3:27pm

yeah looks like maybe your node is having trouble keeping up, are you seeing high iowait / disk backlog
and what disk model / storage setup are you using, SMR HDD’s can often have trouble keeping up with the rather extensive write IOPS of the storagenodes, which can lead to stuff like this.

ticktockbent · November 7, 2021, 3:36pm

Forgive me for saying so but I have trouble believing it’s data throughput issue. The unit is a QNAP TS-1635AX, the data is being written to a raid array of 6 8TB seagate ironwolf drives. I can pull the specific model number if needed but as far as I know there are no ironwolf SMR models. Disk usage is very low, this node has hardly any data yet, about 200GB.

ticktockbent · November 7, 2021, 3:42pm

I should note, I have another node running which has been operating continuously for more than a year. It is using an ancient USB WD external enclosure as its storage and shows zero audit errors. Same connection, same network, older hardware and lower throughput.

SGC · November 7, 2021, 3:47pm

throughput and io isn’t exactly the same thing, disks in a raid works as one hdd and so your raid iops would be about the same as a single disk, atleast for dealing with the tons of small io a storagenode creates.

but if your disk backlog and iowait looks fine, then i guess thats not it… but usually when failing downloads its latency related… tho an hour seems rather extreme.

this basically says it waited for an hour, before timing out…
and then because you didn’t fail this same audit the second time it was attempted it must have been a success else your audit score should have dropped…

not like your system clock is wrong or something weird like that… i’m not sure the node would even work with wrong clock tho…
but it isn’t to long since we switched to winter time… so maybe… i think there was a couple of people having some issues with the switch to winter time… don’t recall the details tho…

so maybe check that your system clock of the nas is correct

ticktockbent · November 7, 2021, 3:52pm

I suppose it could be related to the daylight savings time change. Would that be enough to drop a node to suspension on its own? That seems harsh.

System clock is correct. It resynchronizes daily anyway so it would never be incorrect for more than 24 hours.

SGC · November 7, 2021, 4:03pm

when suspension score drops its almost always latency related.
your main problem currently is when you go back above 60% in suspension score for that satellite you will enter a 30 day trial period, and if your node drops below 60% again that satellite will DQ.

ticktockbent:

CONTAINER ID   IMAGE                          COMMAND         CREATED        STATUS       PORTS                                                      NAMES
6ebccda8985d   storjlabs/storagenode:latest   "/entrypoint"   4 months ago   Up 10 days

created 4 months ago… what version are you running?
you should be on v1.41.2
and if you aren’t your should update the node immediately and also install watchtower to take care of the automatic updates

SGC · November 7, 2021, 4:08pm

not sure if an old version could make your suspension score drop, but it’s possible thats why this is happening

ticktockbent · November 7, 2021, 4:08pm

I’m not terribly worried about getting disqualified on the node, it’s not utilized much. My objective is to understand what happened. I won’t miss the penny/month I earn from it. (not kidding)

I believe it is up to date but I will check it and get watchtower running.

SGC · November 7, 2021, 4:09pm

well the docker container is recreated on updates… so it shouldn’t be able to be on a new version and be 4 months old.

you can see the node version in the webdashboard.

being more than 2 versions behind will also give the node zero ingress

ticktockbent · November 7, 2021, 4:20pm

It is up to date apparently, but I’m setting up watchtower now

ticktockbent · November 7, 2021, 4:29pm

I’ll mark your conjecture about the time change being the cause as the answer for now, I’m not sure what else would have caused it honestly. I certainly don’t have an hour+ of latency.

Thank you for your help!

Update: So all I did was restart the container, same version and image, same data. Now the node is unsuspended and shows lots of ingress. Maybe a problem with the container itself? who knows.

SGC · November 7, 2021, 5:25pm

sometimes stuff is just weird, keep and eye on that the suspension score goes up and keeps going up, i forget how long that can take, but its days if not closer to 30 days total… to reach 100%

ticktockbent · November 7, 2021, 7:21pm

Will keep watching it. It’s already back up to 66.43%

ticktockbent · November 8, 2021, 1:15pm

Great idea, I’ll do that right away. I did do a manual resynchronization before I restarted the container so that makes sense.

Edit: It looks like the most often I can do is 1 hour via the UI of the device. I’ll see if I can set up a cron job or something to do it more often.

As an update, the suspension score after 1 day is now back up to 93.82%, so recovery was very very fast.

ticktockbent · November 8, 2021, 5:09pm

After several more hours of operation, it’s at 97.28% and out of danger. Just including this for anyone interested, apparently nodes recover extremely quickly from time drift.

ticktockbent · November 9, 2021, 1:10pm

A day later we’re at 99.91%. Remarkably fast recovery. I’ll stop updating at this point, it looks like this is solved. I was just reporting the numbers in case someone was curious about the recovery time. If anything further develops I’ll revisit this. Thanks all for the assistance!