Node stops/hangs regularly

Since a couple of weeks my nodes (2) stop or hang regularly. Sometimes every 24 hours. Starting or restarting solves the problem, but want to get more insight on why they stop/hang.

They are running on a QNAP with the latest firmware and docker version. Both nodes run each on their own USB drive.

I stopped watchtower to avoid any interference with other dockers.

Here is an example of a stop this morning:

2023-10-09T14:44:33Z INFO piecestore downloaded {“process”: “storagenode”, “Piece ID”: “UY7BOYWE4OPZXU4OPDKASPJTHT5JSENH7CBSXE5V6IGWWPFI6IZQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”, “Offset”: 0, “Size”: 1280, “Remote Address”: “184.104.224.98:34524”}
2023-10-09T14:44:34Z INFO Download finished. {“Process”: “storagenode-updater”, “From”: “https://github.com/storj/storj/releases/download/v1.88.3/storagenode_linux_amd64.zip”, “To”: “/tmp/storagenode_linux_amd64.2184206584.zip”}
2023-10-09T14:44:34Z INFO Restarting service. {“Process”: “storagenode-updater”, “Service”: “storagenode”}
2023-10-09T14:44:34Z INFO piecestore upload started {“process”: “storagenode”, “Piece ID”: “Y7LUWE6AHXHYUBDUNS7NCNNMIQV2ENEBNWZ2H36YYCUZBKAFGKOA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Available Space”: 5927494286464, “Remote Address”: “216.66.40.82:49314”}
2023-10-09T14:44:34Z INFO piecestore download started {“process”: “storagenode”, “Piece ID”: “UY7BOYWE4OPZXU4OPDKASPJTHT5JSENH7CBSXE5V6IGWWPFI6IZQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”, “Offset”: 0, “Size”: 1280, “Remote Address”: “23.237.191.122:51590”}
2023-10-09T14:44:35Z INFO piecestore upload started {“process”: “storagenode”, “Piece ID”: “ZNZNVXOHC7XDZGHZJVRRGAG5UMYTRIBTNXYP4CZN6YHJRYBA3TRA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Available Space”: 5927494264448, “Remote Address”: “95.163.17.6:49449”}
2023-10-09T14:44:35Z INFO piecestore downloaded {“process”: “storagenode”, “Piece ID”: “UY7BOYWE4OPZXU4OPDKASPJTHT5JSENH7CBSXE5V6IGWWPFI6IZQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”, “Offset”: 0, “Size”: 1280, “Remote Address”: “23.237.191.122:51590”}
2023-10-09T14:44:35Z INFO Service restarted successfully. {“Process”: “storagenode-updater”, “Service”: “storagenode”}
2023-10-09T14:44:35Z INFO Got a signal from the OS: “interrupt” {“process”: “storagenode”}
2023-10-09T14:44:35Z INFO Current binary version {“Process”: “storagenode-updater”, “Service”: “storagenode-updater”, “Version”: “v1.87.1”}
2023-10-09T14:44:35Z INFO Download started. {“Process”: “storagenode-updater”, “From”: “https://github.com/storj/storj/releases/download/v1.88.3/storagenode-updater_linux_amd64.zip”, “To”: “/tmp/storagenode-updater_linux_amd64.1765104221.zip”}
2023-10-09T14:44:35Z ERROR pieces:trash emptying trash failed {“process”: “storagenode”, “error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).EmptyTrash:161\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:316\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:406\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:83\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89”}
2023-10-09T14:44:35Z INFO lazyfilewalker.used-space-filewalker subprocess exited with status {“process”: “storagenode”, “satelliteID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “status”: -1, “error”: “signal: killed”}
2023-10-09T14:44:35Z INFO piecestore upload canceled {“process”: “storagenode”, “Piece ID”: “EBOSRSNE3ARVIBA7B4RNSKICGG2TDQULHUBOFFNNBK244RBHICOA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Size”: 2031616, “Remote Address”: “184.104.224.99:52482”}
2023-10-09T14:44:35Z ERROR pieces failed to lazywalk space used by satellite {“process”: “storagenode”, “error”: “lazyfilewalker: signal: killed”, “errorVerbose”: “lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:83\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”}
2023-10-09T14:44:35Z INFO lazyfilewalker.used-space-filewalker starting subprocess {“process”: “storagenode”, “satelliteID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”}
2023-10-09T14:44:35Z ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {“process”: “storagenode”, “satelliteID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “error”: “context canceled”}
2023-10-09T14:44:35Z ERROR pieces failed to lazywalk space used by satellite {“process”: “storagenode”, “error”: “lazyfilewalker: context canceled”, “errorVerbose”: “lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:71\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”}
2023-10-09T14:44:35Z ERROR piecestore:cache error getting current used space: {“process”: “storagenode”, “error”: “filewalker: context canceled; filewalker: context canceled”, “errorVerbose”: “group:\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
2023-10-09T14:44:35Z INFO piecestore upload canceled {“process”: “storagenode”, “Piece ID”: “VBMK7DD4VJ2O5JFOHVKFEMUNNFKIHMJP6GHUCTRJ2U2QBLSMJOFQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Size”: 1703936, “Remote Address”: “23.237.191.122:45524”}
2023-10-09T14:44:35Z INFO piecestore upload canceled {“process”: “storagenode”, “Piece ID”: “Y43ZMXZURF2KF7NV4JHKXUB73N6BW25SP7BLTE3CJT7SB4RLOTIQ”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 1900544, “Remote Address”: “184.104.224.98:55476”}
2023-10-09T14:44:35Z INFO piecestore upload canceled (race lost or node shutdown) {“process”: “storagenode”, “Piece ID”: “Y7LUWE6AHXHYUBDUNS7NCNNMIQV2ENEBNWZ2H36YYCUZBKAFGKOA”}
2023-10-09T14:44:35Z INFO piecestore upload canceled {“process”: “storagenode”, “Piece ID”: “5TKZO26APSYLJKXLJ2MGQFLMYDC2R74HP75IXH3DIJW6M7AOPXDQ”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “PUT”, “Size”: 794624, “Remote Address”: “157.211.109.157:45568”}
2023-10-09T14:44:35Z INFO piecestore upload canceled (race lost or node shutdown) {“process”: “storagenode”, “Piece ID”: “ZNZNVXOHC7XDZGHZJVRRGAG5UMYTRIBTNXYP4CZN6YHJRYBA3TRA”}
2023-10-09T14:44:36Z INFO Download finished. {“Process”: “storagenode-updater”, “From”: “https://github.com/storj/storj/releases/download/v1.88.3/storagenode-updater_linux_amd64.zip”, “To”: “/tmp/storagenode-updater_linux_amd64.1765104221.zip”}
2023-10-09T14:44:36Z INFO Restarting service. {“Process”: “storagenode-updater”, “Service”: “storagenode-updater”}
2023-10-09 14:44:36,287 INFO exited: storagenode-updater (exit status 1; not expected)
2023-10-09 14:44:37,303 INFO spawned: ‘storagenode-updater’ with pid 1424
2023-10-09 14:44:37,318 WARN received SIGQUIT indicating exit request
2023-10-09 14:44:37,319 INFO waiting for storagenode, processes-exit-eventlistener, storagenode-updater to die
2023-10-09 14:44:37,320 INFO stopped: storagenode-updater (terminated by SIGTERM)
2023-10-09 14:44:37,541 INFO stopped: storagenode (terminated by SIGTERM)
2023-10-09 14:44:37,542 INFO stopped: processes-exit-eventlistener (terminated by SIGTERM)

Interesting - this SIGINT was two seconds before the final SIGQUIT/SIGTERM. Because of the interleaving of storagenode-updater logs, could this be that the updater decided the node needed updating?

Do you have anything in your kernel logs indicating something else decided to SIGINT/SIGTERM? Does dmesg report that the oom killer has activated?

1 Like

After the start, the current version of v1.88.3. I think yesterday (before this happened) I was on the previous version.

How do I see kernel logs?

There’s a program called dmesg that you can run as root or with sudo that will give you the kernel log buffer. dmesg -w if you want to watch what’s happening. I don’t know much about your setup though and so maybe the place to find the activity of other things happening in your system is systemd (journalctl -f) or inside /var/log or something.

You have to find the Fatal in the logs.

wich drives ?/ file system?

dmesg: What should I be looking for? This is a big dump.

FATAL: I don’t see any FATAL log record (I only see FATAL in the Piece ID - I don’t think that is what you are looking for)

Drives/File system: This is a Seagate Backup+ Hub BK USB drive with NTFS file system

This is the root cause. The NTFS file system is working very poorly under Linux. And these drives likely have a corrupted filesystems. Please stop and remove the container, safe eject the drive, connect it to Windows and perform chkdsk /f for the drive. Perhaps it would require several runs before it completely fix errors. Then you may safe eject it, connect back to QNAP and run this node, then you need to do the same for the second node.
I would recommend to migrate to ext4 if possible (you will need to stop and remove the container, backup data, then reformat to ext4 and restore data, then run the container back).

2 Likes

Thanks! Let me get started on that!

Status update!

  • Node 1: I tried to copy the node data onto a EXT4 disk, but forgot to keep it running. Stupid me… I lost the node in the meantime.
  • Node 2 (biggest):
    – I migrated the node to an Ubuntu mini PC with faster CPU, faster OS disk, moved the databases to a secondary SSD and it has been running smoothly since. Ingress has been going up.
    – I’m even copying that data to a EXT4 disk while keeping the node running.

Finding: not sure if it is a good idea to run 2 nodes on a QNAP due to bottleneck and instabilities.

You may run the second node if the first almost full, then the second will start to accept more data when the first one will be full.