Node Suspended randomly>

No idea what happened that caused this but I got an email earlier saying my node was suspended. Logged into the server running the node and checked the dashboard to see this. I backed out and realized there was an application update so I updated it and checked everything was running and it was. I went to work expecting to come home and see it better, but all my suspension scores are worse now? First confirmed the identity to make sure nothing happened to those files and both checks came back with 2 then 3 for each one. So those are correct and the identity is not the problem. Ok into the unifi controller to check that nothing changed with the port forwarding rules. Nope those are still correct, port 28967 is being forwarded to the correct IP address of my server running storj. Now to check my external and make sure the port is open there. Yes it is. And even checked with cloud flare that the dns is working properly. Can from my phone off of wifi through Verizon I can ping the server with 0 issues. I have no idea what is going on. But I see several other operators are having issues as well so maybe something is wrong on storj side? I skimmed through the active logs to see a few satellite cannot ping server errors, but I have no issue with connections on my end.

same here… I saw some error messages related to a lock in the database on my node.

2022-02-04T08:01:18.360Z	ERROR	bandwidth	Could not rollup bandwidth usage	{"error": "bandwidthdb: database is locked", "errorVerbose": "bandwidthdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*bandwidthDB).Rollup:301\n\tstorj.io/storj/storagenode/bandwidth.(*Service).Rollup:53\n\tstorj.io/common/sync2.(*Cycle).Run:152\n\tstorj.io/storj/storagenode/bandwidth.(*Service).Run:45\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

1 Like

I wonder if storj is the reason. Something is not right all of the sudden nothing in the setup changes and errors start flying up. Brought my attention to update the docker and it is still having issues.

Same here. Node restarted automatically.

Just saw a few of those database locked errors in my logs from a few minutes ago. But my suspension scores on 3 of the 5 have slightly gone up.

Something like that?

Error: readdirent config/storage/blobs/6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa/4z: not a directory; readdirent config/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/2u: not a directory
2022-02-04T10:10:51.038Z        ERROR   db      Unable to read the disk, please verify the disk is not corrupt

No mine looks like this

2022-02-04T13:13:23.976Z ERROR piecestore failed to add bandwidth usage {“error”: “bandwidthdb: database is locked”, “errorVerbose”: “bandwidthdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*bandwidthDB).Add:60\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).beginSaveOrder.func1:722\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func6:664\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:685\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:104\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:97\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52”}

1 Like

I think we saw this when the disk was really busy. Is this a SMR disk? Or is it in heavy use by other applications?

1 Like

my server is running a monthly parity sync. And it is an old drive this data is on. And yes the drives holding the storj data are the few smr drives I have. I am just waiting till I have a change to swap them out for CMR drives like every other drive I have.

Could you have a look at your logs for any lines with ERROR and GET_AUDIT or GET_REPAIR in it?

Also are you using a network protocol to access the data? SMB, NFS etc. These aren’t supported and could also lead to lock contention (and unfortunately also corruption) of the databases.

I have been trying to look at the logs with only those lines but am still a noob with cli on linux. I have been pulling docker logs storagenode-v3 to just pull them all then scan through manually. The actual docker container is running on a different box then the docker container and it has been accessed through smb. But it has been running like this for over a month fine. I am confused why all of the sudden it starts to fail out.

i see the bandwidth.db locked if my storage cannot keep up.

and random node reboots i’ve had issues with because i changed my docker storage drivers to the recommended (overlay2 or whatever its named) rather than the vfs which it defaults to in my case.

i do get a docker error about something with locations cannot be upperdir when my proxmox containers start the storagenode in docker.

but i’m 2 years in a still haven’t seen any harmful effects from using vfs as the docker storage driver, and haven’t been able to fix the error… and when i tried it all hell broke loose.

not saying that this is the problem, but that’s similar problems i’ve had.

1 Like

You’ll find instructions here:

Unfortunately network protocols are not supported. They could work, but it is not guaranteed, and as @BrightSilence said - lead to database locks and sometimes even corruption.

the dangerous thing about SMB is that it seems to work… but afaik it doesn’t work.

Eh, it can be made to work if the operator has skills. But it’s somewhat difficult, so it makes sense it’s not an officially supported configuration.

so you have it working… afaik there is an incompatibility issue between SMB and database calls in general, and its a well known issue, which is why it’s basically never recommended anywhere that you run database calls over SMB

Had. Was stable, but slow. Besides, it was a temporary setup.

Half of the job is to use the hard option at mount. The other half is disabling any kind of caches, both server- and client-side. Makes thing very slow, but that’s the price to pay to keep databases consistent.

It’s easier now, as you can keep blobs on a different file system than databases. This was not the case at the time.

Are you sure that would make it entirely reliable? Since SQLite doesn’t support it at all as far as I’m aware.

Heh, please define «entirely reliable».

I’m running a piece of ten year old hardware to host Storj nodes, an HP Microserver Gen7. It’s way out of support already. Its disk controller officially never supported drives bigger than 2TB, whereas I’ve got a couple of 6TB drives inside.

Is it reliable? Well, likely way better than all the overheating Raspberry Pis with fragile power supplies and shared I/O busses with low throughput.

I had that SMB setup for about 3 months and during that time the only problem I found was that it was slow. SQLite doesn’t support it? Well, that only means I didn’t have some other person’s stamp of approval for doing my things.