Hi
had a really bad power outage and 2 nodes have a corrupt filesystem but are running at least “normal”. One stopped working completly after a clean reboot. The filesystem seems fine but the dbs where broken. I moved them all to another folder and recreated them. With the new files I still get this error and my needs reboots after seconds. What should I else do?
INFO Got a signal from the OS: "terminated" {"Process": "storagenode-updater"}
2024-09-02 20:26:33,335 INFO stopped: storagenode-updater (exit status 0)
2024-09-02T20:26:33+02:00 ERROR failure during run {"Process": "storagenode", "error": "Error during preflight check for storagenode databases: preflight: database: \"piece_expiration\": failed inserting test value: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).preflight:513\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Preflight:438\n\tmain.cmdRun:115\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271", "errorVerbose": "Error during preflight check for storagenode databases: preflight: database: \"piece_expiration\": failed inserting test value: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).preflight:513\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Preflight:438\n\tmain.cmdRun:115\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271\n\tmain.cmdRun:117\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271"}
Error: Error during preflight check for storagenode databases: preflight: database: "piece_expiration": failed inserting test value: context canceled
storj.io/storj/storagenode/storagenodedb.(*DB).preflight:513
storj.io/storj/storagenode/storagenodedb.(*DB).Preflight:438
main.cmdRun:115
main.newRunCmd.func1:33
storj.io/common/process.cleanup.func1.4:392
storj.io/common/process.cleanup.func1:410
github.com/spf13/cobra.(*Command).execute:983
github.com/spf13/cobra.(*Command).ExecuteC:1115
github.com/spf13/cobra.(*Command).Execute:1039
storj.io/common/process.ExecWithCustomOptions:112
main.main:34
runtime.main:271
2024-09-02 20:26:33,688 INFO stopped: storagenode (exit status 1)
2024-09-02 20:26:33,697 INFO stopped: processes-exit-eventlistener (terminated by SIGTERM)
You need to fix the filesystem (you likely need to run the fsck command several times until it wouldn’t throw any warnings and errors), then you need to check and fix the databases:
or re-create the corrupted ones (losing the history and stat):
piece_expiration.db: re-name it - leave it renamed, try delete all others, replacement upon re-start again. Context cancelled seems to me = it gave up, be it time-out, hard fault what have you. Possible bad cluster, scan entire disk thereafter, ie: used AND free space.
If you believe so. But if the node is old enough, it could be useful to fix the issue instead.
If the node cannot open a database file, it could mean that the disk may have issues, you may try to move the databases to a different drive:
Yes the Node is really old, but brand new DBs are not working and moving them should change nothing.
Scanned the drive for filesystem errors und found nothing.
Please, search for Unrecoverable and/or FATAL errors in the logs, there is still a chance, that the root cause is not the inability to open a database, otherwise moving them could help actually. Because the “context canceled” usually meaning a timeout, so your disks cannot respond as quick as they should.
The problem with a docker local logs driver, that it has only 5 files 20MiB each, so only 100MiB of logs, and likely it lost these lines already. You may increase the number of log files and/or their size, or redirect logs to the file to have a longer history. But in the latter case you also want to configure a logrotate to do not allow to grow it indefinitely:
The alternative method is to use an endpoint /mon/ps on the debug port, or use a simple naïve method of listing blobs folders by the access date (if you have enabled it) or this approach:
You need to search for a previous error, which was invoked the node’s kill, all these messages are a consequence of the killing command somewhere earlier. Please search for Unrecoverable error in the logs.