Corrupt node after power outage

Hi :slight_smile:
had a really bad power outage and 2 nodes have a corrupt filesystem but are running at least “normal”. One stopped working completly after a clean reboot. The filesystem seems fine but the dbs where broken. I moved them all to another folder and recreated them. With the new files I still get this error and my needs reboots after seconds. What should I else do?

  INFO    Got a signal from the OS: "terminated"  {"Process": "storagenode-updater"}
2024-09-02 20:26:33,335 INFO stopped: storagenode-updater (exit status 0)
2024-09-02T20:26:33+02:00       ERROR   failure during run      {"Process": "storagenode", "error": "Error during preflight check for storagenode databases: preflight: database: \"piece_expiration\": failed inserting test value: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).preflight:513\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Preflight:438\n\tmain.cmdRun:115\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271", "errorVerbose": "Error during preflight check for storagenode databases: preflight: database: \"piece_expiration\": failed inserting test value: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).preflight:513\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Preflight:438\n\tmain.cmdRun:115\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271\n\tmain.cmdRun:117\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271"}
Error: Error during preflight check for storagenode databases: preflight: database: "piece_expiration": failed inserting test value: context canceled
        storj.io/storj/storagenode/storagenodedb.(*DB).preflight:513
        storj.io/storj/storagenode/storagenodedb.(*DB).Preflight:438
        main.cmdRun:115
        main.newRunCmd.func1:33
        storj.io/common/process.cleanup.func1.4:392
        storj.io/common/process.cleanup.func1:410
        github.com/spf13/cobra.(*Command).execute:983
        github.com/spf13/cobra.(*Command).ExecuteC:1115
        github.com/spf13/cobra.(*Command).Execute:1039
        storj.io/common/process.ExecWithCustomOptions:112
        main.main:34
        runtime.main:271
2024-09-02 20:26:33,688 INFO stopped: storagenode (exit status 1)
2024-09-02 20:26:33,697 INFO stopped: processes-exit-eventlistener (terminated by SIGTERM)

You need to fix the filesystem (you likely need to run the fsck command several times until it wouldn’t throw any warnings and errors), then you need to check and fix the databases:

or re-create the corrupted ones (losing the history and stat):

1 Like

The drive is not the one with the filesystem errors. Sorry if that was no clear :smiley:

I did exactly that guide to recreate them but without moving the old ones back. Stat is not important for me.

piece_expiration.db: re-name it - leave it renamed, try delete all others, replacement upon re-start again. Context cancelled seems to me = it gave up, be it time-out, hard fault what have you. Possible bad cluster, scan entire disk thereafter, ie: used AND free space.

1/3 cent

1 Like

Sounds like starting from scratch is easier?

Just move all *.db files to backup dir. Then manually update your storagenode.exe to newest version - then start service.
Try that

If you believe so. But if the node is old enough, it could be useful to fix the issue instead.
If the node cannot open a database file, it could mean that the disk may have issues, you may try to move the databases to a different drive:

Yes the Node is really old, but brand new DBs are not working and moving them should change nothing.
Scanned the drive for filesystem errors und found nothing.

I dont know what to do :see_no_evil:

Please, search for Unrecoverable and/or FATAL errors in the logs, there is still a chance, that the root cause is not the inability to open a database, otherwise moving them could help actually. Because the “context canceled” usually meaning a timeout, so your disks cannot respond as quick as they should.

1 Like

I created a folder on my SSD and let the docker create all DBs new to make sure they are clean. Same error still and I think I now know why…

The storage-dir-verification file was missing.
Glad I made a backup of that file. Now I am back online :smiley:

Now that all DBs are new, I wonder if the dashboard will update the real used memory again over time?

1 Like

New error after some hours. Read it normal for a new database. Should I change something or just let it run?

2024-09-04T18:20:26+02:00       ERROR   blobscache      satPiecesContentSize < 0        {"Process": "storagenode", "satPiecesContentSize": -3328}

Yes, when the used-space-filewalker would finish the scan without errors.

So, not new, and the solution is the same - you need to have all used-space-filewalker finished without errors.
You may track them:

docker logs storagenode 2>&1 | grep "\sused-space" | grep -E "started|completed"
1 Like

Thank you!
I dont get any results with the command:

docker logs storagenode_14TB_WD_Storj_2 2>&1 | grep "\sused-space" | grep -E "started|completed"

Does this mean its not running?

1 Like

You may check do you have any errors related to a filewalker:

docker logs storagenode_14TB_WD_Storj_2 2>&1 | grep "error" | grep "filewalker"

The problem with a docker local logs driver, that it has only 5 files 20MiB each, so only 100MiB of logs, and likely it lost these lines already. You may increase the number of log files and/or their size, or redirect logs to the file to have a longer history. But in the latter case you also want to configure a logrotate to do not allow to grow it indefinitely:

The alternative method is to use an endpoint /mon/ps on the debug port, or use a simple naïve method of listing blobs folders by the access date (if you have enabled it) or this approach:

1 Like

Restarted the node to trigger some errors:

2024-09-06T09:51:35+02:00       INFO    lazyfilewalker.used-space-filewalker    subprocess exited with status   {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "status": -1, "error": "signal: killed"}
2024-09-06T09:51:35+02:00       ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Lazy File Walker": true, "error": "lazyfilewalker: signal: killed", "errorVerbose": "lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:133\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:722\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-09-06T09:51:35+02:00       ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Lazy File Walker": false, "error": "filewalker: context canceled", "errorVerbose": "filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-09-06T09:51:35+02:00       ERROR   piecestore:cache        encountered error while computing space used by satellite       {"Process": "storagenode", "error": "filewalker: context canceled", "errorVerbose": "filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "SatelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-09-06T09:51:35+02:00       ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "error": "context canceled"}
2024-09-06T09:51:35+02:00       ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": true, "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:133\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:722\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-09-06T09:51:35+02:00       ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": false, "error": "filewalker: context canceled", "errorVerbose": "filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-09-06T09:51:35+02:00       ERROR   piecestore:cache        encountered error while computing space used by satellite       {"Process": "storagenode", "error": "filewalker: context canceled", "errorVerbose": "filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "SatelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-09-06T09:51:35+02:00       INFO    lazyfilewalker.trash-cleanup-filewalker subprocess exited with status   {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "status": -1, "error": "signal: killed"}
2024-09-06T09:51:35+02:00       ERROR   pieces:trash    emptying trash failed   {"Process": "storagenode", "error": "pieces error: lazyfilewalker: signal: killed", "errorVerbose": "pieces error: lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkCleanupTrash:195\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:436\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:84\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89"}
2024-09-06T09:51:35+02:00       INFO    lazyfilewalker.gc-filewalker    subprocess exited with status   {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "status": -1, "error": "signal: killed"}
2024-09-06T09:51:35+02:00       ERROR   pieces  lazyfilewalker failed   {"Process": "storagenode", "error": "lazyfilewalker: signal: killed", "errorVerbose": "lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkSatellitePiecesToTrash:163\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:575\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:380\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:265\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-09-06T09:51:35+02:00       ERROR   retain  retain pieces failed    {"Process": "storagenode", "cachePath": "config/retain", "error": "retain: filewalker: context canceled", "errorVerbose": "retain: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:181\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:582\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:380\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:265\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

You need to search for a previous error, which was invoked the node’s kill, all these messages are a consequence of the killing command somewhere earlier. Please search for Unrecoverable error in the logs.