Recently, my node has dropped suspension score for saltlake.tardigrade.io, and I don’t know why. I’m trying to figure out what the suspension score even means, but I cannot find any documentation. A few days ago, the suspension score was 95%. A day later, it was below 80%. Now it is around the 60% suspension breakpoint.
Looking at the logs, I see no indication of problems. Pinging my node from storjnet.info/ping_my_node works without problems. How can I find the problems I have with my node? It’s been online for 5 months without issues before this.
I did and found a bunch of “download failed”. Can this be causing the issue? This is an example of the log output docker logs -f storj | grep "ERROR":
2021-10-01T16:57:33.416Z ERROR piecestore download failed
{
"Piece ID": "<censored>",
"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE",
"Action": "GET",
"error": "write tcp 172.19.0.2:28967->45.15.192.34:58268: use of closed network connection",
"errorVerbose": "write tcp 172.19.0.2:28967->45.15.192.34:58268: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:317\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:392\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:608\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"
}
What I’m worried about is that I’m getting a lot of delete actions recently (3-5 a second with only 370GB of total stored data) which seems odd to me. Have hackers found a way to request unauthorized delete actions?
I think it might be too busy IO. My zfs volume is sitting at ~80-95% IO from storj. Is there something I can do to improve this? Maybe put certain storj directories on an ssd? I would be sad if zfs with hdd’s are too slow for storj.
I don’t search the logs manually. I used docker logs storj | grep "GET_REPAIR".
If I do docker logs storj 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep failed as mentioned in the docs you sent, I get no hits.
If I count the audits as mentioned in the docs like this:
I would like to ask you to stop the container, unmount the disk and check it with sudo fsck -f /dev/sdX, where /dev/sdX is your disk with data.
If the command would return an error or ** FILESYSTEM IS CHANGED ** then run it one more time until all errors would be fixed.
After that mount the disk back and start the container.
If data would be ok, the suspension score should return back to 100% on your dashboard after a while.
I’m running it on a zfs filesystem, so i’m not sure if fsck is the right tool for checking for errors for a zfs filesystem. zpool status reports no errors, but I started a scrub just to be sure.
total used free shared buff/cache available
Mem: 1970924 956976 650832 4616 363116 931044
Swap: 985460 78868 906592
The swap is not drive based. It is a “Zram” which compresses ram memory and put it back into ram (In case you think swapping might be slowing my machine down).
Interesting setup.
I would not use zfs if I would have only 2GB of RAM.
This is not optimal, to put it mildly. Zram does not makes it better.
How much disks do you have? And how large?
I’ve heard people say that zfs requires a lot of ram, but I’ve been using it like this for over 2 years with no problems (no data loss and no latency issues). However, my usage before storj might have been less IO dependent. The setup is 4 disks in a z1 setup (1 redundancy disk). Each disk is 2.7TB.
Filesystem Size Used Avail Use% Mounted on
tank 7.7T 6.0T 1.8T 77% /tank
I’m starting to think that the problem might be disk IO (too many small io operations), or network related (Blocked by my ISP/router firewall). iotop reports 80-95% io usage for storj. Maybe my setup is just a tiny bit too slow for storj.
exactly. Each piece is around 2MB. This is not optimal for default zfs “cluster” size. @SGC can explain better
For 7TB disk you need at least 7GB of RAM. This is not requirement, this is practice.
Perhaps your IOPS dropped to almost zero at some point and your node returned some unknown error on audit request. And did this several times in a row. As result your suspension score has dropped too fast.
The next stage - is on the way to disqualification, because the suspension will not keep from DQ forever.
The re-creating of the container reset the state for a while. I believe if you would monitor your logs more closely until the next release - you will see these errors, especially if you setup the store of logs to the file.
Hi, Thanks for helping out.
I cannot post more today because there is a limit how many posts I can make during my first day since joining storj forum.
I did run your script and the mean time for GET_AUDIT actions seems to be around 200-400 milliseconds for the past 7 days (the age of the container). I’ll continue to monitor the logs and post any updates to my post in the days to come, but today I cannot post anything more. Thanks for helping out.