Docker container stops every day and nodes go offline

Hi,

It’s been weeks that my docker container storagenode stops once a day and then restart…
I receive this message by email
“Docker container storagenode stopped unexpectedly. Please select storagenode on the Container page, click the Details button, and go to the Log tab for details.”

I checked the log and Im not expert but yeah I see a lot of errors like this one.

2023-04-11T23:56:27.614Z ERROR piecestore upload failed {“Process”: “storagenode”, “Piece ID”: “WTKYLCT3JO53JLTI6DLPIMQJ2UP2SNVNVSGKC6RL546AI3YZNPUA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “error”: “context canceled”, “errorVerbose”: “context canceled\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func5:498\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:529\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:243\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35”, “Size”: 399104, “Remote Address”: “172.17.0.1:37728”}

Also, sometimes my nodes go offline and I have to reboot my container to make them online again.

Any ideas about it ?

Thanks

  • OS used?
  • type of HDD used? (SMR or CMR)
  • how does CPU and RAM usage usually look like?
  • your docker command anonymized?

:v:t2:

Hi,

OS used?

Nas synology - ds920+

type of HDD used? (SMR or CMR)

3 HDD,all CMR

how does CPU and RAM usage usually look like?

I have 20Go ram and process should be fine. I’m running 20 docker containers but nothing crazy.

your docker command anonymized?

docker run -d --restart unless-stopped --stop-timeout 300
-p 28967:28967/tcp
-p 28967:28967/udp
-p 14002:14002
-e WALLET=“"
-e EMAIL="
@gmail.com"
-e ADDRESS="
********.synology.me:28967”
-e STORAGE=“4TB”
–mount type=bind,source=“/volume1/docker/storj/identity”,destination=/app/identity
–mount type=bind,source=“/volume1/docker/storj/data”,destination=/app/config
–name storagenode storjlabs/storagenode:latest

Thanks,
pasc

This is not the error which causes node to crash, only FATAL ones.
Please search for a fatal error (not upload/download errors, they will not lead to the crash).

2 Likes

Hard to find in the log the error as there are thousands of lines. And even with the command since, or tail it does not reach the record I need.

Also why do I get so many "ERROR piecestore upload failed " ? Something wrong ?

Thanks,
Pascal

Likely because your node was too slow against other nodes.
However you cannot be close to all customers in the world, so some races are impossible to win.
From the other hand it could be your network or disk too, depends on the remaining part of the error.
From the initial post it is a long tail cancelation - i.e. your node loosed the race.

You can search for FATAL ones, i.e.

docker logs storagenode 2>&1 | grep FATAL | tail


Didn’t work…

Did you redirect logs to the file?
If so, you need to use a different command (update the path with yours):

cat /mnt/storj/storagenode/storagenode.log | grep FATAL | tail

if not, please try to search for stopped

docker logs storagenode 2>&1 | grep stopped | tail

Did you redirect logs to the file?

I just copied pasted your commands :laughing:, nothing more.

But you need to update parameters accordingly your setup.
Well, try a different one, when the node would stop again:

sudo docker logs --tail 20 storagenode

and copy result here between two new lines with three backticks like this:

```
logs lines here
```

I’m not an expert with docker and commands Still learning :grinning:. But I really appreciate your efforts to help Alexey.

So yes that one works but in my case it doesn’t make sense. Let’s say the fatal error happens at 3 am and I check the logs at 8am, it’ already 5h. 5h is about 10 or 20 thousands lines in the logs already.
That’s why I’m trying to find a command that isolates a time between h and h.
There is nothing like this can do it.

Thank you,
Pascal

Edit your config.yaml, change “log.level” to “error”, save it and restart your node, this will make storagenode to log only errors.

If you using Linux then when then error occurs you can use “tail -f 100 <your_log_file>” to view last 100 lines of your logs file.

If you using Docker then you can change the log.output in config.yaml to stdout or comment the line to make it go default setting, restart the container, then when the error happens you can use command “docker logs --tail 100 <your_docker_name_or_id>” which also loads the last 100 lines of your container logs.

But I think you can go to the Fatal Error topic, I think you may get the same error as many people.

the docker version of storagenode uses log.output=stderr by default, so if it’s not changed, then docker logs storagenode will show logs out of the box.
The writeable and readable timeouts from the mentioned thread seems affect mostly slow Windows nodes with RAID, low RAM, with USB or network connected drives.

@nyancodex you are correct, the docker node will be restarted after crash.
Let’s try to filter errors in a different way:

docker logs storagenode 2>&1 | grep -i error | grep -vE "GET|PUT" | tail

Yes this one seems working !! But I tried a few things before and restarted the container, now I lost the previous log and have to wait for the next fatal error.

2023-04-16T03:08:34.449Z        INFO    Invalid configuration file key  {"Process": "storagenode-updater", "Key": "error"}
2023-04-16T03:08:34.589Z        INFO    Invalid configuration file key  {"Process": "storagenode", "Key": "error"}
2023-04-16T03:18:17.532Z        ERROR   piecestore:cache        error getting current used space:       {"Process": "storagenode", "error": "context canceled; context canceled; context canceled; context canceled; context canceled; context canceled", "errorVerbose": "group:\n--- context canceled\n--- context canceled\n--- context canceled\n--- context canceled\n--- context canceled\n--- context canceled"}
2023-04-16T03:18:17.547Z        ERROR   pieces:trash    emptying trash failed   {"Process": "storagenode", "error": "pieces error: filestore error: context canceled", "errorVerbose": "pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:156\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:316\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:400\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:83\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89"}

Those errors are normal ?

Thanks,
Pascal

you should look for error with “FATAL”

No, they are not normal: your node cannot get an information about free space from your disk. It either corrupted or too slow to respond.

However, since there are no errors about unexpected stop of the runner, they are likely not fatal.

You have Synology NAS. It’s a piece of cake to check the logs. Access DSM in your browser, go to Docker > Containers > select the storagenode you want to check > Details > Logs. Then you have a search field top right. Enter any word you want to search for and the log lines will be filtered by that word.

Yes but usually it doesnt show me enough to get my log. I finally found it.

2023-04-16T10:33:30.273118575Z	stdout	2023-04-16T10:33:30.272Z ERROR piecestore download failed {"Process": "storagenode", "Piece ID": "5JSW3E7IPRDFYHTHZ74DLLRJX3NX4VN6VTTHZSDJ3AFKNJM4CTKQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET", "Remote Address": "172.17.0.1:47304", "error": "untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup ap1.storj.io: operation was canceled", "errorVerbose": "untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup ap1.storj.io: operation was canceled\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:140\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:62\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:599\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-04-16T10:33:08.728936422Z	stdout	2023-04-16T10:33:03.499Z INFO Version is up to date {"Process": "storagenode-updater", "Service": "storagenode-updater"}
2023-04-16T10:33:08.728894626Z	stdout	2023-04-16T10:33:03.499Z INFO Current binary version {"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.76.2"}
2023-04-16T10:33:08.728854932Z	stdout	2023-04-16T10:33:03.490Z INFO Version is up to date {"Process": "storagenode-updater", "Service": "storagenode"}
2023-04-16T10:33:08.728806217Z	stdout	2023-04-16T10:33:03.490Z INFO Current binary version {"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.76.2"}
2023-04-16T10:33:08.728755391Z	stdout	2023-04-16 10:33:01,846 INFO success: storagenode-updater entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-16T10:33:08.728671823Z	stdout	2023-04-16 10:33:01,846 INFO success: storagenode entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-16T10:33:01.846197334Z	stdout	2023-04-16 10:33:01,845 INFO success: processes-exit-eventlistener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-16T10:33:00.845814356Z	stdout	2023-04-16T10:33:00.845Z INFO Downloading versions. {"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2023-04-16T10:33:00.845205578Z	stdout	2023-04-16T10:33:00.844Z INFO Running on version {"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.76.2"}
2023-04-16T10:33:00.825020974Z	stdout	2023-04-16T10:33:00.824Z INFO Anonymized tracing enabled {"Process": "storagenode-updater"}
2023-04-16T10:33:00.824979435Z	stdout	2023-04-16T10:33:00.824Z INFO Invalid configuration file value for key {"Process": "storagenode-updater", "Key": "log.stack"}
2023-04-16T10:33:00.824941450Z	stdout	2023-04-16T10:33:00.823Z INFO Invalid configuration file value for key {"Process": "storagenode-updater", "Key": "log.output"}
2023-04-16T10:33:00.824887527Z	stdout	2023-04-16T10:33:00.823Z INFO Invalid configuration file value for key {"Process": "storagenode-updater", "Key": "log.development"}
2023-04-16T10:33:00.824849413Z	stdout	2023-04-16T10:33:00.823Z INFO Invalid configuration file key {"Process": "storagenode-updater", "Key": "contact.external-address"}
2023-04-16T10:33:00.824811694Z	stdout	2023-04-16T10:33:00.823Z INFO Invalid configuration file key {"Process": "storagenode-updater", "Key": "healthcheck.details"}
2023-04-16T10:33:00.824766809Z	stdout	2023-04-16T10:33:00.823Z INFO Invalid configuration file key {"Process": "storagenode-updater", "Key": "server.address"}
2023-04-16T10:33:00.824728955Z	stdout	2023-04-16T10:33:00.823Z INFO Invalid configuration file key {"Process": "storagenode-updater", "Key": "storage.allocated-bandwidth"}
2023-04-16T10:33:00.824690266Z	stdout	2023-04-16T10:33:00.823Z INFO Invalid configuration file key {"Process": "storagenode-updater", "Key": "console.address"}
2023-04-16T10:33:00.824650093Z	stdout	2023-04-16T10:33:00.823Z INFO Invalid configuration file key {"Process": "storagenode-updater", "Key": "operator.wallet-features"}
2023-04-16T10:33:00.824611539Z	stdout	2023-04-16T10:33:00.823Z INFO Invalid configuration file key {"Process": "storagenode-updater", "Key": "operator.wallet"}
2023-04-16T10:33:00.824570343Z	stdout	2023-04-16T10:33:00.823Z INFO Invalid configuration file key {"Process": "storagenode-updater", "Key": "operator.email"}
2023-04-16T10:33:00.824528181Z	stdout	2023-04-16T10:33:00.823Z INFO Invalid configuration file key {"Process": "storagenode-updater", "Key": "healthcheck.enabled"}
2023-04-16T10:33:00.824475331Z	stdout	2023-04-16T10:33:00.823Z INFO Invalid configuration file key {"Process": "storagenode-updater", "Key": "storage.allocated-disk-space"}
2023-04-16T10:33:00.824424111Z	stdout	2023-04-16T10:33:00.823Z INFO Invalid configuration file key {"Process": "storagenode-updater", "Key": "server.private-address"}
2023-04-16T10:33:00.824263681Z	stdout	2023-04-16T10:33:00.823Z INFO Configuration loaded {"Process": "storagenode-updater", "Location": "/app/config/config.yaml"}
2023-04-16T10:33:00.812256698Z	stdout	2023-04-16 10:33:00,807 INFO spawned: 'storagenode-updater' with pid 14
2023-04-16T10:33:00.812220125Z	stdout	2023-04-16 10:33:00,805 INFO spawned: 'storagenode' with pid 13
2023-04-16T10:33:00.812178457Z	stdout	2023-04-16 10:33:00,803 INFO spawned: 'processes-exit-eventlistener' with pid 12
2023-04-16T10:33:00.812102643Z	stdout	2023-04-16 10:32:59,802 INFO supervisord started with pid 1
2023-04-16T10:32:57.973689213Z	stdout	2023-04-16 10:32:57,973 INFO RPC interface 'supervisor' initialized
2023-04-16T10:32:56.685465916Z	stdout	2023-04-16 10:32:56,685 INFO Set uid to user 0 succeeded
2023-04-16T10:30:39.006107934Z	stdout	2023-04-16 10:30:38,928 INFO stopped: processes-exit-eventlistener (terminated by SIGTERM)
2023-04-16T10:30:39.006069609Z	stdout	2023-04-16 10:30:38,927 INFO stopped: storagenode (terminated by SIGKILL)
2023-04-16T10:30:39.006018810Z	stdout	2023-04-16 10:30:38,574 WARN killing 'storagenode' (5895) with SIGKILL
2023-04-16T10:30:39.005890267Z	stdout	2023-04-16 10:30:37,573 INFO waiting for storagenode, processes-exit-eventlistener to die
2023-04-16T10:30:34.570765566Z	stdout	2023-04-16 10:30:34,570 INFO waiting for storagenode, processes-exit-eventlistener to die
2023-04-16T10:30:32.540785010Z	stdout	2023-04-16 10:30:31,566 INFO waiting for storagenode, processes-exit-eventlistener to die
2023-04-16T10:30:32.540749885Z	stdout	2023-04-16 10:30:28,563 INFO stopped: storagenode-updater (exit status 0)
2023-04-16T10:30:32.540696374Z	stdout	2023-04-16T10:30:28.561Z INFO Got a signal from the OS: "terminated" {"Process": "storagenode-updater"}
2023-04-16T10:30:32.540654082Z	stdout	2023-04-16 10:30:28,561 INFO waiting for storagenode, processes-exit-eventlistener, storagenode-updater to die
2023-04-16T10:30:32.540616139Z	stdout	2023-04-16 10:30:28,556 WARN received SIGQUIT indicating exit request
2023-04-16T10:30:32.540575030Z	stdout	2023-04-16 10:30:28,295 INFO spawned: 'storagenode' with pid 5895
2023-04-16T10:30:32.540490885Z	stdout	2023-04-16 10:30:27,235 INFO exited: storagenode (exit status 1; not expected)
2023-04-16T10:30:25.738449022Z	stdout	Error: piecestore monitor: timed out after 1m0s while verifying writability of storage directory 

Thanks,
Pascal

This line gets you straight to this Thread Fatal Error on my Node

tl;dr try this

2 Likes