Node stops/hangs regularly

Krispi · October 9, 2023, 3:11pm

Since a couple of weeks my nodes (2) stop or hang regularly. Sometimes every 24 hours. Starting or restarting solves the problem, but want to get more insight on why they stop/hang.

They are running on a QNAP with the latest firmware and docker version. Both nodes run each on their own USB drive.

I stopped watchtower to avoid any interference with other dockers.

Here is an example of a stop this morning:

2023-10-09T14:44:33Z 2023-10-09T14:44:34Z 2023-10-09T14:44:34Z 2023-10-09T14:44:34Z 2023-10-09T14:44:34Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:35Z 2023-10-09T14:44:36Z 2023-10-09T14:44:36Z 2023-10-09 2023-10-09 2023-10-09 2023-10-09 2023-10-09 2023-10-09 2023-10-09 INFO piecestore downloaded {“process”: “storagenode”, “Piece ID”: “UY7BOYWE4OPZXU4OPDKASPJTHT5JSENH7CBSXE5V6IGWWPFI6IZQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”, “Offset”: 0, “Size”: 1280, “Remote Address”: “184.104.224.98:34524”}
INFO Download finished. {“Process”: “storagenode-updater”, “From”: “https://github.com/storj/storj/releases/download/v1.88.3/storagenode_linux_amd64.zip”, “To”: “/tmp/storagenode_linux_amd64.2184206584.zip”}
INFO Restarting service. {“Process”: “storagenode-updater”, “Service”: “storagenode”}
INFO piecestore upload started {“process”: “storagenode”, “Piece ID”: “Y7LUWE6AHXHYUBDUNS7NCNNMIQV2ENEBNWZ2H36YYCUZBKAFGKOA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Available Space”: 5927494286464, “Remote Address”: “216.66.40.82:49314”}
INFO piecestore download started {“process”: “storagenode”, “Piece ID”: “UY7BOYWE4OPZXU4OPDKASPJTHT5JSENH7CBSXE5V6IGWWPFI6IZQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”, “Offset”: 0, “Size”: 1280, “Remote Address”: “23.237.191.122:51590”}
INFO piecestore upload started {“process”: “storagenode”, “Piece ID”: “ZNZNVXOHC7XDZGHZJVRRGAG5UMYTRIBTNXYP4CZN6YHJRYBA3TRA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Available Space”: 5927494264448, “Remote Address”: “95.163.17.6:49449”}
INFO piecestore downloaded {“process”: “storagenode”, “Piece ID”: “UY7BOYWE4OPZXU4OPDKASPJTHT5JSENH7CBSXE5V6IGWWPFI6IZQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”, “Offset”: 0, “Size”: 1280, “Remote Address”: “23.237.191.122:51590”}
INFO Service restarted successfully. {“Process”: “storagenode-updater”, “Service”: “storagenode”}
INFO Got a signal from the OS: “interrupt” {“process”: “storagenode”}
INFO Current binary version {“Process”: “storagenode-updater”, “Service”: “storagenode-updater”, “Version”: “v1.87.1”}
INFO Download started. {“Process”: “storagenode-updater”, “From”: “https://github.com/storj/storj/releases/download/v1.88.3/storagenode-updater_linux_amd64.zip”, “To”: “/tmp/storagenode-updater_linux_amd64.1765104221.zip”}
ERROR pieces:trash emptying trash failed {“process”: “storagenode”, “error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).EmptyTrash:161\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:316\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:406\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:83\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89”}
INFO lazyfilewalker.used-space-filewalker subprocess exited with status {“process”: “storagenode”, “satelliteID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “status”: -1, “error”: “signal: killed”}
INFO piecestore upload canceled {“process”: “storagenode”, “Piece ID”: “EBOSRSNE3ARVIBA7B4RNSKICGG2TDQULHUBOFFNNBK244RBHICOA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Size”: 2031616, “Remote Address”: “184.104.224.99:52482”}
ERROR pieces failed to lazywalk space used by satellite {“process”: “storagenode”, “error”: “lazyfilewalker: signal: killed”, “errorVerbose”: “lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:83\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”}
INFO lazyfilewalker.used-space-filewalker starting subprocess {“process”: “storagenode”, “satelliteID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”}
ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {“process”: “storagenode”, “satelliteID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “error”: “context canceled”}
ERROR pieces failed to lazywalk space used by satellite {“process”: “storagenode”, “error”: “lazyfilewalker: context canceled”, “errorVerbose”: “lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:71\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”}
ERROR piecestore:cache error getting current used space: {“process”: “storagenode”, “error”: “filewalker: context canceled; filewalker: context canceled”, “errorVerbose”: “group:\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
INFO piecestore upload canceled {“process”: “storagenode”, “Piece ID”: “VBMK7DD4VJ2O5JFOHVKFEMUNNFKIHMJP6GHUCTRJ2U2QBLSMJOFQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Size”: 1703936, “Remote Address”: “23.237.191.122:45524”}
INFO piecestore upload canceled {“process”: “storagenode”, “Piece ID”: “Y43ZMXZURF2KF7NV4JHKXUB73N6BW25SP7BLTE3CJT7SB4RLOTIQ”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 1900544, “Remote Address”: “184.104.224.98:55476”}
INFO piecestore upload canceled (race lost or node shutdown) {“process”: “storagenode”, “Piece ID”: “Y7LUWE6AHXHYUBDUNS7NCNNMIQV2ENEBNWZ2H36YYCUZBKAFGKOA”}
INFO piecestore upload canceled {“process”: “storagenode”, “Piece ID”: “5TKZO26APSYLJKXLJ2MGQFLMYDC2R74HP75IXH3DIJW6M7AOPXDQ”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “PUT”, “Size”: 794624, “Remote Address”: “157.211.109.157:45568”}
INFO piecestore upload canceled (race lost or node shutdown) {“process”: “storagenode”, “Piece ID”: “ZNZNVXOHC7XDZGHZJVRRGAG5UMYTRIBTNXYP4CZN6YHJRYBA3TRA”}
INFO Download finished. {“Process”: “storagenode-updater”, “From”: “https://github.com/storj/storj/releases/download/v1.88.3/storagenode-updater_linux_amd64.zip”, “To”: “/tmp/storagenode-updater_linux_amd64.1765104221.zip”}
INFO Restarting service. {“Process”: “storagenode-updater”, “Service”: “storagenode-updater”}
14:44:36,287 INFO exited: storagenode-updater (exit status 1; not expected)
14:44:37,303 INFO spawned: ‘storagenode-updater’ with pid 1424
14:44:37,318 WARN received SIGQUIT indicating exit request
14:44:37,319 INFO waiting for storagenode, processes-exit-eventlistener, storagenode-updater to die
14:44:37,320 INFO stopped: storagenode-updater (terminated by SIGTERM)
14:44:37,541 INFO stopped: storagenode (terminated by SIGTERM)
14:44:37,542 INFO stopped: processes-exit-eventlistener (terminated by SIGTERM)

jtolio · October 9, 2023, 8:40pm

Interesting - this SIGINT was two seconds before the final SIGQUIT/SIGTERM. Because of the interleaving of storagenode-updater logs, could this be that the updater decided the node needed updating?

Do you have anything in your kernel logs indicating something else decided to SIGINT/SIGTERM? Does dmesg report that the oom killer has activated?

Krispi · October 9, 2023, 9:04pm

After the start, the current version of v1.88.3. I think yesterday (before this happened) I was on the previous version.

How do I see kernel logs?

jtolio · October 9, 2023, 9:30pm

There’s a program called dmesg that you can run as root or with sudo that will give you the kernel log buffer. dmesg -w if you want to watch what’s happening. I don’t know much about your setup though and so maybe the place to find the activity of other things happening in your system is systemd (journalctl -f) or inside /var/log or something.

daki82 · October 9, 2023, 9:38pm

You have to find the Fatal in the logs.

wich drives ?/ file system?

Krispi · October 9, 2023, 10:53pm

dmesg: What should I be looking for? This is a big dump.

FATAL: I don’t see any FATAL log record (I only see FATAL in the Piece ID - I don’t think that is what you are looking for)

Drives/File system: This is a Seagate Backup+ Hub BK USB drive with NTFS file system

Alexey · October 10, 2023, 2:43am

This is the root cause. The NTFS file system is working very poorly under Linux. And these drives likely have a corrupted filesystems. Please stop and remove the container, safe eject the drive, connect it to Windows and perform chkdsk /f for the drive. Perhaps it would require several runs before it completely fix errors. Then you may safe eject it, connect back to QNAP and run this node, then you need to do the same for the second node.
I would recommend to migrate to ext4 if possible (you will need to stop and remove the container, backup data, then reformat to ext4 and restore data, then run the container back).

Krispi · October 10, 2023, 1:04pm

Thanks! Let me get started on that!

Krispi · October 24, 2023, 11:53pm

Status update!

Node 1: I tried to copy the node data onto a EXT4 disk, but forgot to keep it running. Stupid me… I lost the node in the meantime.
Node 2 (biggest):
– I migrated the node to an Ubuntu mini PC with faster CPU, faster OS disk, moved the databases to a secondary SSD and it has been running smoothly since. Ingress has been going up.
– I’m even copying that data to a EXT4 disk while keeping the node running.

Finding: not sure if it is a good idea to run 2 nodes on a QNAP due to bottleneck and instabilities.

Alexey · October 25, 2023, 3:45am

You may run the second node if the first almost full, then the second will start to accept more data when the first one will be full.