Odroid HC2 seem to keep crashing randomly

gingerbread233 · February 28, 2023, 12:10am

Hello,

I’m running an Storjnode on an Odroid HC2 since almost a year, no problems since this month. The system seem to crash randomly, which is giving me a bad online score. I downloaded the log file via ssh (/var/log/syslog), but I can’t figure out why it’s happening. Are there any logs which can be helpful? Is there a way somebody may take a look at the logfile? I’m appreceating your help.

Knowledge · February 28, 2023, 1:00am

Well, when you say crashing, do you mean it reboots itself, or freezes, or what happens?

Very often it can be the databases have become corrupt or your drive has errors. The info below will help you check your logs. See if you can find the time when it crashes and if there is an error reported.

gingerbread233 · February 28, 2023, 6:03am

It looks like my whole system is kinda rebooting everytime, when rebooting, docker containers get partially or not started including the storj container. Openmediavault is reporting partially damaged sectors on the hard drive, is this maybe causing my system to crash/rebooting? Despite it’s running pretty well until this month. My containers are running on the main microSD, and the data from storj is located at the HDD. I read that only floating sectors are dangerous. I don’t get any DB errors, so I think the DB from the Storj container is good.

sembeth · February 28, 2023, 8:19am

Check your microSD card. They don’t last that long, especially if Docker is constantly writing log files to them.

MattJE96011 · March 1, 2023, 3:48am

Also, if your hitting the drive hard and end up with too much pending IO and latency it could cause the system to lock up in some cases.

gingerbread233 · March 1, 2023, 2:02pm

The microSD not seem to be the problem, I tried a new one, with same issues. I think it might be the drive, S.M.A.R.T is telling me that the drive has damaged blocks, I read that only floating sectors are crucial, but maybe the damaged blocks causing the crashing. One time I was trying start the storj container, but docker was telling me the mount was not detected. So maybe the drive begins to degrade.

Knowledge · March 1, 2023, 3:12pm

In my experience sometimes when drives become dismounted and have errors, the power is often not enough or failing. However what the drive is being powered by may be at fault.

gingerbread233 · March 3, 2023, 9:02pm

The Storj container is giving me this error. It’s the only one I could find.

2023-03-03T20:45:33.768Z	ERROR	piecedeleter	could not send delete piece to trash	{Process: storagenode, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Piece ID: EIPQVHHCZ3QCGFCT4HPM2ULN6TLL2ORHQFIKDHQ2HAAYKVFAERNA, error: pieces error: pieceexpirationdb: database is locked, errorVerbose: pieces error: pieceexpirationdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(pieceExpirationDB).Trash:112\n\tstorj.io/storj/storagenode/pieces.(Store).Trash:368\n\tstorj.io/storj/storagenode/pieces.(Deleter).deleteOrTrash:185\n\tstorj.io/storj/storagenode/pieces.(Deleter).work:135\n\tstorj.io/storj/storagenode/pieces.(Deleter).Run.func1:72\n\tgolang.org/x/sync/errgroup.(Group).Go.func1:75}
2023-03-03T20:45:36.532Z	ERROR	piecestore	failed to add bandwidth usage	{Process: storagenode, error: bandwidthdb: database is locked, errorVerbose: bandwidthdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(bandwidthDB).Add:60\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).beginSaveOrder.func1:829\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).Upload:521\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:235\n\tstorj.io/drpc/drpcmux.(Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(Handler).HandleRPC:61\n\tstorj.io/common/experiment.(Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(Tracker).track:35}
2023-03-03T20:45:36.482Z	ERROR	piecestore	failed to add bandwidth usage	{Process: storagenode, error: bandwidthdb: database is locked, errorVerbose: bandwidthdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(bandwidthDB).Add:60\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).beginSaveOrder.func1:829\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).Upload:521\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:235\n\tstorj.io/drpc/drpcmux.(Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(Handler).HandleRPC:61\n\tstorj.io/common/experiment.(Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(Tracker).track:35}
2023-03-03T20:45:36.581Z	ERROR	piecestore	failed to add bandwidth usage	{Process: storagenode, error: bandwidthdb: database is locked, errorVerbose: bandwidthdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(bandwidthDB).Add:60\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).beginSaveOrder.func1:829\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).Upload:521\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:235\n\tstorj.io/drpc/drpcmux.(Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(Handler).HandleRPC:61\n\tstorj.io/common/experiment.(Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(Tracker).track:35}
2023-03-03T20:45:36.482Z	ERROR	piecestore	failed to add bandwidth usage	{Process: storagenode, error: bandwidthdb: database is locked, errorVerbose: bandwidthdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(bandwidthDB).Add:60\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).beginSaveOrder.func1:829\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).Download.func6:778\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).Download:792\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:243\n\tstorj.io/drpc/drpcmux.(Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(Handler).HandleRPC:61\n\tstorj.io/common/experiment.(Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35}

How can I solve this?

Knowledge · March 3, 2023, 9:16pm

Is this an SMR hard drive? Sounds like it’s too slow to keep up with the traffic to me.

Database being locked, are you running anything else on this node that might be touching the databases?

gingerbread233 · March 3, 2023, 9:29pm

It’s a CMR (WD PURZ 8TB), it’s only the storjnode data and idendity wich is saved on the drive.

Knowledge · March 3, 2023, 9:44pm

I believe this is likely the file walker process moving pieces to the trash when customer requested or expired. @Alexey will have a better idea on that. Is your node still rebooting/locking up? What is the dashboard looking like?

gingerbread233 · March 3, 2023, 9:59pm

unfortunately my node is still rebooting or sometimes hanging up, it’s since 1 week, I want to fix it soon, so my node won’t get a very bad online score. I don’t have the time to look every hour and connect via wireguard from outside my house to restart the service the whole day/night. Today I needed to cut the power and plug it in again, because the whole system hung up. Idk if this is a drive related issue, because it is may failing more often.

Alexey · March 4, 2023, 1:23am

I would recommend to stop and remove the container and check this disk with surface scan to mark all bad blocks, then run the storagenode container back.
If it would still crash, you may also search for OOM events in the system journals:

journalctl | grep -i oom

gingerbread233 · March 4, 2023, 2:24pm

What do I have to do after entering the command you mentioned?

My syslog is telling me this:

Feb 28 00:18:22 odroidxu4 monit[7328]: ‘odroidxu4’ mem usage of 90.7% matches resource limit [mem usage > 90.0%]
Feb 28 00:18:22 odroidxu4 monit[7328]: ‘odroidxu4’ loadavg (5min) of 228.7 matches resource limit [loadavg (5min) > 8.0]
Feb 28 00:18:22 odroidxu4 monit[7328]: ‘odroidxu4’ loadavg (1min) of 281.4 matches resource limit [loadavg (1min) > 16.0]
Feb 28 00:18:54 odroidxu4 monit[7328]: ‘odroidxu4’ mem usage of 90.3% matches resource limit [mem usage > 90.0%]
Feb 28 00:18:54 odroidxu4 monit[7328]: ‘odroidxu4’ loadavg (5min) of 233.3 matches resource limit [loadavg (5min) > 8.0]
Feb 28 00:18:54 odroidxu4 monit[7328]: ‘odroidxu4’ loadavg (1min) of 273.2 matches resource limit [loadavg (1min) > 16.0]
Feb 28 00:18:55 odroidxu4 kernel: [ 6461.998680] TCP: out of memory – consider tuning tcp_mem

can this be a hint for oom?
Regarding Natdata, my RAM is mostly around 50%, but randomly I get an warning with 90% usage. I don’t know what uses the rest of the RAM spontaniously`.

Alexey · March 5, 2023, 4:20am

Because your disk seems misbehaving due to bad blocks, it has increased latency, this lead to more memory usage by the storagenode container likely, in such case you need to enforce a memory limits to the storagenode container:

docker run -d .... --memory=1g ... storjlabs/storagenode:latest

The limit you should select accordingly to how much memory is needed for your OS (what’s free when there is no storagenode container running after reboot, for example raspberry pi3 (1GiB) can survive if limit is set to 800MiB).

gingerbread233 · March 7, 2023, 1:43pm

Thank you, I totally forgot, this option also exist.
Is it also possible to do this setting in the portainers “ressources” tab, where I have the sliders to adjust?

Alexey · March 8, 2023, 4:48am

You may try. I prefer to use CLI though and docker compose.