Hi everybody, I’m getting an error that seems pretty uncommon, and it’s preventing my node from starting. The node is a couple years old so it would sure be nice to be able to salvage it.
How I got in this mess:
My node’s storage is hosted on NFS (two different VM’s on the same machine). Yes I know that NFS is unsupported, although it had been working fine for a year or two. The database files are stored on local SSD in under the home folder.
Anyway, the NFS server had an “oops” and stopped providing any NFS shares (like… a NFS log file filled up? I hadn’t seen the error before). Upon restarting the NFS server, the other 3 nodes I have that use to host their various it started working okay. THIS node, however can’t start.
Here is the relevant section of the storagenode logs where the errors start when I’m trying to start the server.
storagenode | 2024-06-14T09:08:23-07:00 INFO Node 1dNM3fpVxwsPWDJteHboKP3sPoceeW8vncnZXTNiZHYWsyHq8b started {"Pro cess": "storagenode"}
storagenode | 2024-06-14T09:08:23-07:00 INFO Public server started on [::]:28967 {"Process": "storagenode"}
storagenode | 2024-06-14T09:08:23-07:00 INFO Private server started on 127.0.0.1:7778 {"Process": "storagen ode"}
storagenode | 2024-06-14T09:08:23-07:00 INFO trust Scheduling next refresh {"Process": "storagenode", "after": " 5h1m25.160282917s"}
storagenode | 2024-06-14T09:08:23-07:00 INFO bandwidth Persisting bandwidth usage cache to db {"Process": " storagenode"}
storagenode | 2024-06-14T09:08:23-07:00 DEBUG pieces:trash starting to empty trash {"Process": "storagenode"}
storagenode | 2024-06-14T09:08:23-07:00 INFO pieces:trash emptying trash started {"Process": "storagenode", "S atellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
storagenode | 2024-06-14T09:08:23-07:00 DEBUG contact:chore Starting cycle {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
storagenode | 2024-06-14T09:08:23-07:00 DEBUG contact:chore Starting cycle {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
storagenode | 2024-06-14T09:08:23-07:00 DEBUG contact:chore Starting cycle {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
storagenode | 2024-06-14T09:08:23-07:00 DEBUG contact:chore Starting cycle {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
storagenode | 2024-06-14T09:08:23-07:00 DEBUG version Allowed minimum version from control server. {"Process": " storagenode", "Minimum Version": "1.104.5"}
storagenode | 2024-06-14T09:08:23-07:00 DEBUG version Running on allowed version. {"Process": "storagenode", "V ersion": "1.104.5"}
storagenode | 2024-06-14T09:08:23-07:00 DEBUG contact:endpoint pinged {"Process": "storagenode", "by": "1wF TAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "srcAddr": "35.235.123.171:5491"}
storagenode | 2024-06-14T09:08:23-07:00 INFO pieces:trash emptying trash finished {"Process": "storagenode", "S atellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "elapsed": "162.434795ms"}
storagenode | 2024-06-14T09:08:23-07:00 INFO pieces:trash emptying trash started {"Process": "storagenode", "S atellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
storagenode | 2024-06-14T09:08:23-07:00 DEBUG contact:endpoint pinged {"Process": "storagenode", "by": "1wF TAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "srcAddr": "35.235.123.171:29204"}
storagenode | 2024-06-14T09:08:23-07:00 ERROR piecestore:cache error getting current used space: {"Pro cess": "storagenode", "error": "pieces error: failed to enumerate satellites: node ID: not enough bytes to make a node id; ha ve 1, need 32", "errorVerbose": "pieces error: failed to enumerate satellites: node ID: not enough bytes to make a node id; h ave 1, need 32\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:692\n\tstorj.io/storj/storagenode/p ieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/s torj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storagenode | 2024-06-14T09:08:23-07:00 ERROR services unexpected shutdown of a runner {"Process": "storagen ode", "name": "piecestore:cache", "error": "pieces error: failed to enumerate satellites: node ID: not enough bytes to make a node id; have 1, need 32", "errorVerbose": "pieces error: failed to enumerate satellites: node ID: not enough bytes to make a node id; have 1, need 32\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:692\n\tstorj.io/storj/s toragenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n \tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storagenode | 2024-06-14T09:08:23-07:00 ERROR contact:service ping satellite failed {"Process": "storagenode", "S atellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "attempts": 1, "error": "ping satellite: rpc: tcp connec tor failed: rpc: dial tcp 34.159.134.91:7777: operation was canceled", "errorVerbose": "ping satellite: rpc: tcp connector fa iled: rpc: dial tcp 34.159.134.91:7777: operation was canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"}
storagenode | 2024-06-14T09:08:23-07:00 INFO contact:service context cancelled {"Process": "storagenode", "S atellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
storagenode | 2024-06-14T09:08:23-07:00 ERROR contact:service ping satellite failed {"Process": "storagenode", "S atellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "attempts": 1, "error": "ping satellite: rpc: tcp connec tor failed: rpc: dial tcp 34.126.92.94:7777: operation was canceled", "errorVerbose": "ping satellite: rpc: tcp connector fai led: rpc: dial tcp 34.126.92.94:7777: operation was canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"}
storagenode | 2024-06-14T09:08:23-07:00 INFO contact:service context cancelled {"Process": "storagenode", "S atellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
storagenode | 2024-06-14T09:08:23-07:00 ERROR nodestats:cache Get pricing-model/join date failed {"Process": " storagenode", "error": "context canceled"}
storagenode | 2024-06-14T09:08:23-07:00 ERROR contact:service ping satellite failed {"Process": "storagenode", "S atellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "attempts": 1, "error": "ping satellite: context cancele d", "errorVerbose": "ping satellite: context canceled\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:203\ n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:157\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycle s.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/err group.(*Group).Go.func1:78"}
storagenode | 2024-06-14T09:08:23-07:00 INFO contact:service context cancelled {"Process": "storagenode", "S atellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
And then it just ends up restarting endlessly in a loop with the error.
What I tried:
Making sure I have enough free space on my storage drive and database drive. Making sure network connectivity is good. I also tried just deleting all the .db files like the steps here:
How to fix database: file is not a database error – Storj
Does anybody have any clever ideas on how to salvage the node? All of the data seems to still be safe. A zfs scrub should finish in… 2 days… sigh… to confirm it.