Node in migration to hashstore failed to restart after power outage

snorkel · November 10, 2025, 8:31pm

There was a power outage and the UPS stopped the NAS (Synology/Docker), as it suppose to. The NAS runs 2 nodes/2 drives.
After the power was back on, the node with active migration failed to restart.
One node restarted as usual - already migrated to hashstore.
One node failed to restart - migration active and not finished.
docker ps -a said the container exited x hours ago.
I restarted the node, it works as normal, no errors, and the log entries during the shutdown are these:

2025-11-10T14:19:30Z    INFO    piecemigrate:chore      processed a bunch of pieces     {"Process": "storagenode", "successes": 11100000, "size": 2727849211904}
2025-11-10 14:21:15,248 WARN received SIGTERM indicating exit request
2025-11-10 14:21:15,277 INFO waiting for processes-exit-eventlistener, storagenode, storagenode-updater to die
2025-11-10T14:21:15Z    INFO    Got a signal from the OS: "terminated"  {"Process": "storagenode-updater"}
2025-11-10 14:21:15,735 INFO stopped: storagenode-updater (exit status 0)
2025-11-10T14:21:16Z    INFO    Got a signal from the OS: "terminated"  {"Process": "storagenode"}
2025-11-10T14:21:16Z    ERROR   piecemigrate:chore      failed to enqueue for migration {"Process": "storagenode", "error": "couldn't list new pieces to migrate: filewalker: context canceled", "errorVerbose": "couldn't list new pieces to migrate: filewalker: context canceled\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).enqueueSatellite:238\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).runOnce:199\n\tstorj.io/common/sync2.(*Cycle).Run:102\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).Run.func1:181\n\tstorj.io/common/errs2.(*Group).Go.func1:23", "sat": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2025-11-10T14:21:16Z    ERROR   piecemigrate:chore      failed to enqueue for migration {"Process": "storagenode", "error": "couldn't list new pieces to migrate: filewalker: context canceled", "errorVerbose": "couldn't list new pieces to migrate: filewalker: context canceled\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).enqueueSatellite:238\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).runOnce:199\n\tstorj.io/common/sync2.(*Cycle).Run:102\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).Run.func1:181\n\tstorj.io/common/errs2.(*Group).Go.func1:23", "sat": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2025-11-10T14:21:16Z    INFO    piecemigrate:chore      all enqueued for migration; will sleep before next pooling      {"Process": "storagenode", "active": {"121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6": true, "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S": true, "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs": true}, "interval": "1h0m0s"}
2025-11-10T14:21:16Z    INFO    piecemigrate:chore      couldn't migrate        {"Process": "storagenode", "error": "while copying the piece: context canceled", "errorVerbose": "while copying the piece: context canceled\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).copyPiece:396\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).migrateOne:359\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).processQueue:277\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).Run.func2:184\n\tstorj.io/common/errs2.(*Group).Go.func1:23", "sat": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "id": "LMEEAPASU7ESP3ZY6RDMRDPOSOORWNU6KUI4BA3FDIQBQBLOK7QA"}
2025-11-10 14:21:18,509 INFO waiting for processes-exit-eventlistener, storagenode to die
2025-11-10 14:21:21,513 INFO waiting for processes-exit-eventlistener, storagenode to die
2025-11-10 14:21:24,518 INFO waiting for processes-exit-eventlistener, storagenode to die
2025-11-10 14:21:26,520 WARN killing 'storagenode' (42) with SIGKILL
2025-11-10 14:21:26,876 WARN stopped: storagenode (terminated by SIGKILL)
2025-11-10 14:21:26,876 WARN stopped: processes-exit-eventlistener (terminated by SIGTERM)

Any ideeas? What could prevented it from restarting? Anyone expirienced this during migration?
Why are there time stamps with and without T between date and time?

alpharabbit · November 10, 2025, 8:39pm

Have you done a file system check?

RecklessD · November 10, 2025, 10:04pm

Looks like it was slow to shutdown, so was killed.

Maybe docker didn’t start it because the shutdown was not clean - what is your restart setting in docker - “unless-stopped” or “always”?

Alexey · November 11, 2025, 3:36am

with T are logs from the storagenode and/or storagenode-updater processes, without T - from the base image or/and likely from the supervisor.

snorkel · November 11, 2025, 9:29am

Nope. Don’t know how to do that and if it works on Syno. But it is working after restart, no errors. It’s not the drive.
The restart params are as recommended by the official docs.
Ok, it is slow to stop, but when I stop it manualy, it stops in less than 10 seconds. So the 300 seconds set by docker run command should be enough. The time stamps show that this parameter is ignored. Why is that?
The one thing I can think of is that the UPS service or the OS kills everything ignoring the wait period of individual programs or just of docker containers, when it receives a shutdown command from UPS. It’s like a panic mode.