My node is getting bad suspension score, and I don't know why

gronis93 · October 8, 2021, 5:55pm

Recently, my node has dropped suspension score for saltlake.tardigrade.io, and I don’t know why. I’m trying to figure out what the suspension score even means, but I cannot find any documentation. A few days ago, the suspension score was 95%. A day later, it was below 80%. Now it is around the 60% suspension breakpoint.

My stats for: saltlake.tardigrade.io:7777

Suspension: 63.1 % Audit: 100 % Online: 100 %

I’ve been using v1.39.5 for the past week and I had no problems before then.

Yesterday, the suspension score has been declining for the other satellites as well. This is the output of earnings.py:

                        NODE AGE         HELD AMOUNT            REPUTATION                 PAYOUT THIS MONTH
SATELLITE          Joined     Month    Perc     Total       Disq   Susp   Down        Earned        Held      Payout
us1.storj.io	|  2021-06-07     5  |  50%  $   0.93  |   0.00%  5.05%  0.00%  |  $  0.2303   $  0.1151   $  0.1151
    Status: WARNING: Audits failing
us2.storj.io	|  2021-06-07     5  |  50%  $   0.01  |   0.00%  5.79%  2.26%  |  $  0.0035   $  0.0018   $  0.0018
    Status: WARNING: Audits failing
eu1.storj.io	|  2021-06-07     5  |  50%  $   0.21  |   0.00% 11.28%  0.37%  |  $  0.0643   $  0.0322   $  0.0322
    Status: WARNING: Audits failing
ap1.storj.io	|  2021-06-07     5  |  50%  $   0.02  |   0.00%  0.00%  0.00%  |  $  0.0027   $  0.0013   $  0.0013
    Status: 99% Vetting progress (99 / 100 Audits)
europe-north-1	|  2021-06-07     5  |  50%  $   0.01  |   0.00%  0.00%  0.00%  |  $  0.0028   $  0.0014   $  0.0014
    Status: 81% Vetting progress (43 / 100 Audits)
saltlake		|  2021-06-07     5  |  50%  $   1.20  |   0.00% 92.24%  0.00%  |  $  0.1693   $  0.0847   $  0.0847
    Status: WARNING: Audits failing

Looking at the logs, I see no indication of problems. Pinging my node from storjnet.info/ping_my_node works without problems. How can I find the problems I have with my node? It’s been online for 5 months without issues before this.

gronis93 · October 8, 2021, 6:16pm

I did and found a bunch of “download failed”. Can this be causing the issue? This is an example of the log output docker logs -f storj | grep "ERROR":

2021-10-01T16:57:33.416Z	ERROR	piecestore	download failed	
{
    "Piece ID": "<censored>", 
    "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", 
    "Action": "GET", 
    "error": "write tcp 172.19.0.2:28967->45.15.192.34:58268: use of closed network connection", 
    "errorVerbose": "write tcp 172.19.0.2:28967->45.15.192.34:58268: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:317\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:392\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:608\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"
}

What I’m worried about is that I’m getting a lot of delete actions recently (3-5 a second with only 370GB of total stored data) which seems odd to me. Have hackers found a way to request unauthorized delete actions?

INFO	piecedeleter	delete piece sent to trash ...

SGC · October 8, 2021, 6:25pm

yes this is what is causing it.

if i was to hazard a guess your disk cannot keep up with the workload or you have some sort of network issue, does it also happen to uploads?

no hackers, people delete stuff thats just random usage

gronis93 · October 8, 2021, 6:29pm

Ok, I guess it could be either one of them. I also have a few failing uploads that looks like this:

2021-10-03T13:43:07.148Z	ERROR	piecestore	upload failed	
{
    "Piece ID": "QYQI7U4HWCYDAHMQRVZQ3D3UMYSTRIBTS6DE3F3MJ7W2JBJXS7HQ", 
    "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", 
    "Action": "PUT", 
    "error": "unexpected EOF", 
    "errorVerbose": "unexpected EOF\n\tstorj.io/common/rpc/rpcstatus.Error:82\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:342\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:220\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:104\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:97\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52", "Size": 163840
}

I think it might be too busy IO. My zfs volume is sitting at ~80-95% IO from storj. Is there something I can do to improve this? Maybe put certain storj directories on an ssd? I would be sad if zfs with hdd’s are too slow for storj.

BrightSilence · October 8, 2021, 6:37pm

This is not the cause. Please look specifically for failed lines with GET_AUDIT or GET_REPAIR in it.

gronis93 · October 8, 2021, 6:47pm

All log items with GET_AUDIT and GET_REPAIR looks like this:

2021-10-08T18:23:58.405Z	INFO	piecestore	downloaded	{"Piece ID": "BERVTW5E7JXOJMYQJOAS7WZVW2LUS7PPFE4FWWVSG7FW7EZVRVVA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_AUDIT"}
2021-10-08T16:15:22.200Z	INFO	piecestore	downloaded	{"Piece ID": "KMCEJOFG55UD652TGATY36ZPMLRLMJQ6Y4UVNLMEAGFUZLDKAH5A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR"}

No ERROR log lines. Just INFO log lines.

BrightSilence · October 8, 2021, 7:17pm

Please use the info here to find audit failures. You’re not going to find them by manually searching.

gronis93 · October 8, 2021, 7:21pm

I don’t search the logs manually. I used docker logs storj | grep "GET_REPAIR".
If I do docker logs storj 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep failed as mentioned in the docs you sent, I get no hits.

If I count the audits as mentioned in the docs like this:

> docker logs storj 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep started -c
524
> docker logs storj 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep downloaded -c
524

So all audits are downloaded successfully.

Alexey · October 8, 2021, 7:25pm

Perhaps you did not configure to store logs into file and after container deletion the old logs are deleted altogether.

The suspension score is failing, when your node answer on audit requests, but provided an error instead of piece.

I would like to ask to search for blobs in your logs:

docker logs storagenode 2>&1 | grep blobs | tail

gronis93 · October 8, 2021, 7:30pm

Perhaps you did not configure to store logs into file and after container deletion the old logs are deleted altogether.

Yes, I did not setup store logs. I’ll do that now. All logs older than a week (7 days) are probably lost since that is the container age.

docker logs storj 2>&1 | grep blobs | tail

This command returns nothing.

Alexey · October 8, 2021, 7:35pm

I would like to ask you to stop the container, unmount the disk and check it with sudo fsck -f /dev/sdX, where /dev/sdX is your disk with data.
If the command would return an error or ** FILESYSTEM IS CHANGED ** then run it one more time until all errors would be fixed.
After that mount the disk back and start the container.
If data would be ok, the suspension score should return back to 100% on your dashboard after a while.

gronis93 · October 8, 2021, 7:40pm

I’m running it on a zfs filesystem, so i’m not sure if fsck is the right tool for checking for errors for a zfs filesystem. zpool status reports no errors, but I started a scrub just to be sure.

Alexey · October 8, 2021, 7:43pm

I see. There is no fsck for zfs because

our filesystem is always correct.

The suspension score do not agree though.
Could you please show the result of the command

free

gronis93 · October 8, 2021, 7:47pm

This is the output of free

              total        used        free      shared  buff/cache   available
Mem:        1970924      956976      650832        4616      363116      931044
Swap:        985460       78868      906592

The swap is not drive based. It is a “Zram” which compresses ram memory and put it back into ram (In case you think swapping might be slowing my machine down).

Alexey · October 8, 2021, 7:55pm

Interesting setup.
I would not use zfs if I would have only 2GB of RAM.
This is not optimal, to put it mildly. Zram does not makes it better.
How much disks do you have? And how large?

df -H

FYI zfs - FreeNAS, Do I need 1GB per TB of usable storage, or 1GB of memory per TB of physical disc? - Server Fault.

gronis93 · October 8, 2021, 8:02pm

I’ve heard people say that zfs requires a lot of ram, but I’ve been using it like this for over 2 years with no problems (no data loss and no latency issues). However, my usage before storj might have been less IO dependent. The setup is 4 disks in a z1 setup (1 redundancy disk). Each disk is 2.7TB.

Filesystem      Size  Used Avail Use% Mounted on
tank            7.7T  6.0T  1.8T  77% /tank

I’m starting to think that the problem might be disk IO (too many small io operations), or network related (Blocked by my ISP/router firewall). iotop reports 80-95% io usage for storj. Maybe my setup is just a tiny bit too slow for storj.

Alexey · October 8, 2021, 8:03pm

exactly. Each piece is around 2MB. This is not optimal for default zfs “cluster” size.
@SGC can explain better

For 7TB disk you need at least 7GB of RAM. This is not requirement, this is practice.
Perhaps your IOPS dropped to almost zero at some point and your node returned some unknown error on audit request. And did this several times in a row. As result your suspension score has dropped too fast.
The next stage - is on the way to disqualification, because the suspension will not keep from DQ forever.

gronis93 · October 8, 2021, 8:08pm

Even though this is the case, Why can’t I see an error in the logs? Either 1 of 2 things happens:

ZFS returns incorrect data
IO takes too long and audit test times out.

Either case, this should be visible in the logs somehow. Or am I mistaken?

Alexey · October 8, 2021, 8:12pm

because

The re-creating of the container reset the state for a while. I believe if you would monitor your logs more closely until the next release - you will see these errors, especially if you setup the store of logs to the file.

This is relatively easy to check, try this script: Got Disqualified from saltlake - #97 by Alexey

gronis93 · October 8, 2021, 8:25pm

Hi, Thanks for helping out.
I cannot post more today because there is a limit how many posts I can make during my first day since joining storj forum.

I did run your script and the mean time for GET_AUDIT actions seems to be around 200-400 milliseconds for the past 7 days (the age of the container). I’ll continue to monitor the logs and post any updates to my post in the days to come, but today I cannot post anything more. Thanks for helping out.