Node suspended from satellite.stefan-benten.de:7777

iops · May 10, 2020, 10:05am

I see my nodes has been suspended from satellite.stefan-benten.de:7777 as of Sat, 09 May 2020 18:35:56 GMT.

I’m still connected to other nodes, if I check the one I’ve been suspended on my audit checks are only 58.8% while uptime is 99.9%. Other nodes are all either 100% or close to it.

Can’t find anything helpful in the logs, or I’m not looking properly. Any ideas?

nerdatwork · May 10, 2020, 12:31pm

You can check audit score using this post.

Secondly check your log for download failed and GET_AUDIT entries.

Alexey · May 10, 2020, 1:01pm

The 1.3.3 version of storagenode shows the audit score instead of audits checks, but still in percentage.
If your Audit check on the dashboard is below 60 - your node not only suspended, it’s disqualified on that satellite.

I would suggest you to check the reason of failing audits:

iops · May 10, 2020, 1:59pm

Thanks for the quick responses. Not really sure why this is happening since nothing has changed on my end.

Output of the script below.

 "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW"

{
“totalCount”: 7859,
“successCount”: 6921,
“alpha”: 11.750656187641185,
“beta”: 8.249343812358802,
“score”: 0.5875328093820597
}
“1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”
{
“totalCount”: 3060,
“successCount”: 3060,
“alpha”: 19.99999999999995,
“beta”: 0,
“score”: 1
}
“121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”
{
“totalCount”: 8811,
“successCount”: 8811,
“alpha”: 19.99999999999995,
“beta”: 0,
“score”: 1
}
“12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”
{
“totalCount”: 14625,
“successCount”: 14564,
“alpha”: 19.999999802846165,
“beta”: 1.97153818037965e-07,
“score”: 0.9999999901423091
}
“12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”
{
“totalCount”: 12497,
“successCount”: 11325,
“alpha”: 16.951479805557543,
“beta”: 3.0485201944424403,
“score”: 0.8475739902778779
}
“12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”
{
“totalCount”: 42,
“successCount”: 42,
“alpha”: 17.796337795299877,
“beta”: 0,
“score”: 1
}

And lots of these errors.
2020-05-06T00:00:30.905Z ERROR piecestore download failed {"Piece ID": "2N7RGLDBYSS7JBBZPTD7F26R3PCEYYQA6VSWPZ7E7VOMMEW77R7A", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/pb/pbgrpc.init.0.func3:70\n\tstorj.io/common/rpc/rpcstatus.Wrap:77\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doDownload:566\n\tstorj.io/storj/storagenode/piecestore.(*drpcEndpoint).Download:471\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:995\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:66\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

I don’t seem to be getting these errors from other nodes. Integrity check seems to come back ok for pieceinfo.db

Alexey · May 10, 2020, 2:08pm

Mean that your node is losing data. The audit score on other satellites is failing too.
I would suggest to stop the storagenode and check your disk for errors, perhaps it’s dying.

iops · May 10, 2020, 7:02pm

Thanks, any specific test you suggest I run?

nerdatwork · May 10, 2020, 7:37pm

Most HDD manufacturers have their own software that gives an insight on the health of their HDDs.

iops · May 13, 2020, 7:10am

I can’t find anything wrong with my disk, what are the next steps?
I use this disk for other tasks too and don’t have any issues there.

Checking blocks 0 to 1953514583
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)

    Overall Status: GOOD
ID# Name                        Value Worst Thres Pretty      Raw            Type    Updates Good Good/Past
  1 raw-read-error-rate         200   200    51   0           0x000000000000 prefail online  yes  yes 
  3 spin-up-time                176   172    21   4.2 s       0x3e1000000000 prefail online  yes  yes 
  4 start-stop-count            100   100     0   19          0x130000000000 old-age online  n/a  n/a 
  5 reallocated-sector-count    200   200   140   0 sectors   0x000000000000 prefail online  yes  yes 
  7 seek-error-rate             100   253     0   0           0x000000000000 old-age online  n/a  n/a 
  9 power-on-hours               86    86     0   1.2 years   0xfc2700000000 old-age online  n/a  n/a 
 10 spin-retry-count            100   253     0   0           0x000000000000 old-age online  n/a  n/a 
 11 calibration-retry-count     100   253     0   0           0x000000000000 old-age online  n/a  n/a 
 12 power-cycle-count           100   100     0   19          0x130000000000 old-age online  n/a  n/a 
192 power-off-retract-count     200   200     0   2           0x020000000000 old-age online  n/a  n/a 
193 load-cycle-count            198   198     0   7420        0xfc1c00000000 old-age online  n/a  n/a 
194 temperature-celsius-2       118   108     0   29.0 C      0x1d0000000000 old-age online  n/a  n/a 
196 reallocated-event-count     200   200     0   0           0x000000000000 old-age online  n/a  n/a 
197 current-pending-sector      200   200     0   0 sectors   0x000000000000 old-age online  n/a  n/a 
198 offline-uncorrectable       100   253     0   0 sectors   0x000000000000 old-age offline n/a  n/a 
199 udma-crc-error-count        200   200     0   0           0x000000000000 old-age online  n/a  n/a 
200 multi-zone-error-rate       100   253     0   0           0x000000000000 old-age offline n/a  n/a

BrightSilence · May 13, 2020, 10:51am

It’s pretty hard to say what exactly went wrong. You can look whether the missing file exists on your node. The error you posted seems to suggest they don’t. It could also have been because your node somehow temporarily didn’t have access to the storage location. It’s not always indicative of a permanent problem. But you definitely had a problem at the time of those errors.

iops · May 13, 2020, 12:50pm

All the files seem to be there, I’m not really sure what the process is now?
Now my node seems to be offline.

2020-05-13T12:42:19.753Z	ERROR	nodestats:cache	Get held amount query failed	{"error": "heldamount service error: protocol error: unknown rpc: \"/heldamount.HeldAmount/GetPayment\"; heldamount service error: protocol error: unknown rpc: \"/heldamount.HeldAmount/GetPayment\"; heldamount service error: protocol error: unknown rpc: \"/heldamount.HeldAmount/GetPayment\"; heldamount service error: protocol error: unknown rpc: \"/heldamount.HeldAmount/GetPayment\"; heldamount service error: protocol error: unknown rpc: \"/heldamount.HeldAmount/GetPayment\"; heldamount service error: protocol error: unknown rpc: \"/heldamount.HeldAmount/GetPayment\"", "errorVerbose": "group:\n--- heldamount service error: protocol error: unknown rpc: \"/heldamount.HeldAmount/GetPayment\"\n\tstorj.io/drpc/drpcwire.UnmarshalError:26\n\tstorj.io/drpc/drpcstream.(*Stream).HandlePacket:156\n\tstorj.io/drpc/drpcmanager.(*Manager).manageStreamPackets:313\n--- heldamount service error: protocol error: unknown rpc: \"/heldamount.HeldAmount/GetPayment\"\n\tstorj.io/drpc/drpcwire.UnmarshalError:26\n\tstorj.io/drpc/drpcstream.(*Stream).HandlePacket:156\n\tstorj.io/drpc/drpcmanager.(*Manager).manageStreamPackets:313\n--- heldamount service error: protocol error: unknown rpc: \"/heldamount.HeldAmount/GetPayment\"\n\tstorj.io/drpc/drpcwire.UnmarshalError:26\n\tstorj.io/drpc/drpcstream.(*Stream).HandlePacket:156\n\tstorj.io/drpc/drpcmanager.(*Manager).manageStreamPackets:313\n--- heldamount service error: protocol error: unknown rpc: \"/heldamount.HeldAmount/GetPayment\"\n\tstorj.io/drpc/drpcwire.UnmarshalError:26\n\tstorj.io/drpc/drpcstream.(*Stream).HandlePacket:156\n\tstorj.io/drpc/drpcmanager.(*Manager).manageStreamPackets:313\n--- heldamount service error: protocol error: unknown rpc: \"/heldamount.HeldAmount/GetPayment\"\n\tstorj.io/drpc/drpcwire.UnmarshalError:26\n\tstorj.io/drpc/drpcstream.(*Stream).HandlePacket:156\n\tstorj.io/drpc/drpcmanager.(*Manager).manageStreamPackets:313\n--- heldamount service error: protocol error: unknown rpc: \"/heldamount.HeldAmount/GetPayment\"\n\tstorj.io/drpc/drpcwire.UnmarshalError:26\n\tstorj.io/drpc/drpcstream.(*Stream).HandlePacket:156\n\tstorj.io/drpc/drpcmanager.(*Manager).manageStreamPackets:313"}

iops · May 13, 2020, 12:58pm

If I go through my dashboard I can’t find an issue with other audit checks. This is quite frustrating since I’ve been running my node for almost a year now. Is a disqualification permanent?

I also did not receive an email about being suspended.

satellite.stefan-benten.de:7777
99.9% uptime 58.8% audit

saltlake.tardigrade.io:7777
99.1% uptime 100% audit

asia-east-1.tardigrade.io:7777
99.7% uptime 100% audit

us-central-1.tardigrade.io:7777
99.7% uptime 100% audit

europe-west-1.tardigrade.io:7777
99.6% uptime 93.7% audit

europe-north-1.tardigrade.io:7777
92.6% uptime 100% audit

BrightSilence · May 13, 2020, 1:03pm

This one is not an issue on your end. You can ignore that one.

In that case it must have been a temporary issue that caused your node to be unable to read files.

Unfortunately, yes. But the stefan-benten satellite seems to be on the way out. Testing is moving to saltlake anyway, so you may not be missing out on much and other satellites seem to still be fine.

Snoopy1234 · October 14, 2020, 9:57am

Hi have a Problem with this too.

Looks like this and Node is offline.

BrightSilence · October 14, 2020, 9:58am

That satellite is no longer online. You don’t have a problem if everything else looks fine. You can ignore the scores for this satellite.

Alexey · November 1, 2020, 10:00am

10 posts were split to a new topic: My suspension score for two satellites is 99.98 and 99.99% for us-centrail1 and asia-east respectively