Node Suspension

This is what my disk utilization looks like:

Doesn’t look too bad overall. But the 3 drives from my RAID Array seem to be quite utilized…

Upon further enquiry the times of the supension seem to correlate with high read activities on the volume:

Your Storage Node on the us-central-1 Satellite was suspended. You were suspended on 2020-04-23 at 01:02 UTC.

Your Storage Node on the asia-east-1 Satellite was suspended. You were suspended on 2020-04-23 at 06:06 UTC.

Note that I am located in Germany so it is UTC+2

1 Like

SHR and SMR are completely different things. I use an SHR array as well. It’s just a bunch of mdadm RAID5 or RAID6 arrays tied together using LVM. So performance is similar to those solutions.

4 Likes

Node : 1mB1fgDmophpSY2BEutEPVR9S4pZ6LC7P4Q1gGpEWbpE3uGhmg

Yesterday, i received 2 emails saying my node is suspended on 2 satellites, checked my node. Did not found any issue. 100% audit rate. Restarted the whole thing and updated the OS in one shot, then hours later got 2 others emails.

So far it is suspendend on:

europe-west-1
us-central-1
saltlake
asia-east-1

Any clue please ? I saw it could be a locked database issue, how i can check in logs ?

1 Like

But WD RED PRO is not SMR

1 Like

ok, found many uccurences of this :slight_smile:
2020-04-23T10:23:40.726878176Z 2020-04-23T10:23:40.726Z ERROR piecestore failed to add bandwidth usage {“error”: “bandwidthdb error: database is locked”, “errorVerbose”: “bandwidthdb error: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*bandwidthDB).Add:59\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).saveOrder:721\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doUpload.func5:346\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doUpload:362\n\tstorj.io/storj/storagenode/piecestore.(*drpcEndpoint).Upload:215\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:987\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:105\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:56\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:93\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51”}

My 4TB HDD is linked to on USB2.
Is there a way i move the database to my system SSD drive ? (ubuntu)

Just noticed in my emails from overnight that Asia also triggered a suspension email. Timing is the same as others here.
No errors during that time in the logs. Audits are 100.0%. No indication of an issue on the dashboard.

I do see ERRORs for database locked sprinkled around but the times don’t line up at all. Considering they just implemented these suspensions yesterday and given all of these seemingly false positives reported here, I think Storj has some work to do on the triggers.
From a feedback perspective, the email itself should include a support link that tells SNOs how to identify the symptoms and common resolutions for different scenarios. Also a notification when it is off suspension is needed.

Node ID: 12UzxJS8iFiYuL7AqpSE7tmjkp8B4aCNopjNZ5yrQP3TBymgCzc
Your Storage Node on the asia-east-1 Satellite was suspended. You were suspended on 2020-04-23 at 06:06 UTC.

1 Like

Looks like I have one too. 1UGvhMLgbvhDgauKayZDogrvect5Ekv3C4zxXMEUdipSvFJtHy, eu-west-1
162 “database is locked” errors, 130 of them are “failed to add order”. 0 failed audits in node logs, audit score in sno dashboard is, however, under 100% for all satellites.
Seagate SMR drive - ST8000AS0002.

No other suspensions so far even though I have more nodes on same drives, one of the nodes is now under extremely heavy io load.

Hey everyone, We are sorry for all of the confusion with these suspension mode emails. Your nodes are not going to be DQ’ed because of this. We prematurely setup automatic emails for suspension mode, these emails have now been paused so you should no longer be receiving them and we will enable them again once we are able to fix the known issues.

The issue is in the storage node logs we are missing some error logs that tell you if you failed an audit or not. Suspension mode is currently triggering too rapidly so nodes are going in and out of suspension mode within minutes. Don’t worry your node will NOT be DQ’ed because of this. We also found an issue with Storage Node database locking up which we are investigating.

Thank you for all of your feedback on this topic. this feedback has been tremendously helpful for us to track down and resolve the last issues with this feature!

23 Likes

Is it possible to have those kinds of logs in a separate one? Having a single log with all the traffic (valid or not) and the system errors are not really usable if you are aiming long uptimes on big node.

Currently I have to stop and restart the service every 2 week because the log goes bigger than 2GB witch I’m not comfortable with (and it’s a waste of space).

Having a storagenode_trafic.log and a storagenode_system.log would be much more logical and easy to investigate.

3 Likes

I also received an email regarding node suspension, grepped the log for related errors, none found

also took a look at the dahsboard api and it says audit totalCount: 178, audit successCount: 178, score is 1

as you already mentioned this could be related to some error not getting logged, but im not sure it wouldnt get written to the db used for the audit errors, were such errors should show up, or are they not shown/visible in the success count/total count or score variables?

in case you want to go through the logs, the node id is 1yq49oGxkNRLHUimexB7euadsqH1mhc6CkXJjZnYXe3p7ebUex and the suspended sattelite is saltlake

1 Like
2 Likes

LogRotate is vers linux centred and i’m on Windows. Having a log per day instead of a single huge file would also solve the problem because only the current log file would be locked.

The current implementation is not optimal for windows users.

1 Like

What does locking of log file have to do with nodes being suspended ?

Thank you @brandon for the update.
I missed this in the blueprint phase, but I think notifications to the SNO is a big area for improvement and this will only increase the importance.

I didn’t see my thoughts on notifications on the ideas portal, so I submitted one there. It would be great for the SNO to be able to configure notifications(and which notifications) to the service they prefer. https://ideas.storj.io/ideas/V3-I-191

As long as “storagenode.exe” is running the “storagenode.log” files is locked and you can’t delete it (it said it is open by a process). You have to stop the service to delete or rename the file yourself.

If at the end of the day “storagenode.exe” stopped writing to “storagenode.log”, renamed it “storagenode-20200432.log” and start wrinting a new “storagenode.log” it would release the previous day logs and you could delete them without stopping the service.

Actually there is another solution, when you open a file in windows API you have 2 modes exclusive or not. If it’s exclusive only your process can mess with it, it’s safer but for logs not always the best option. If you open it in write mode, other people can write on it and /or delete it…

1 Like

I would still keep logs if you can afford some space on your HDD. They are very critical in finding bugs and saving lot of troubleshoot time. I agree its annoying to stop and rename log file but its worth the trouble. You can transfer that file elsewhere but having it has more pros than cons.

Im trying to figure out why one of my nodes didn’t get an email though I thought for sure it would…

Maybe that node doesn’t have issues. Suspension isn’t contagious.

1 Like