After failure, can't start node with error " not enough bytes to make a node id"

EasyRhino · June 14, 2024, 4:23pm

Hi everybody, I’m getting an error that seems pretty uncommon, and it’s preventing my node from starting. The node is a couple years old so it would sure be nice to be able to salvage it.

How I got in this mess:

My node’s storage is hosted on NFS (two different VM’s on the same machine). Yes I know that NFS is unsupported, although it had been working fine for a year or two. The database files are stored on local SSD in under the home folder.

Anyway, the NFS server had an “oops” and stopped providing any NFS shares (like… a NFS log file filled up? I hadn’t seen the error before). Upon restarting the NFS server, the other 3 nodes I have that use to host their various it started working okay. THIS node, however can’t start.

Here is the relevant section of the storagenode logs where the errors start when I’m trying to start the server.

storagenode   | 2024-06-14T09:08:23-07:00       INFO    Node 1dNM3fpVxwsPWDJteHboKP3sPoceeW8vncnZXTNiZHYWsyHq8b started {"Pro     cess": "storagenode"}
storagenode   | 2024-06-14T09:08:23-07:00       INFO    Public server started on [::]:28967     {"Process": "storagenode"}
storagenode   | 2024-06-14T09:08:23-07:00       INFO    Private server started on 127.0.0.1:7778        {"Process": "storagen     ode"}
storagenode   | 2024-06-14T09:08:23-07:00       INFO    trust   Scheduling next refresh {"Process": "storagenode", "after": "     5h1m25.160282917s"}
storagenode   | 2024-06-14T09:08:23-07:00       INFO    bandwidth       Persisting bandwidth usage cache to db  {"Process": "     storagenode"}
storagenode   | 2024-06-14T09:08:23-07:00       DEBUG   pieces:trash    starting to empty trash {"Process": "storagenode"}
storagenode   | 2024-06-14T09:08:23-07:00       INFO    pieces:trash    emptying trash started  {"Process": "storagenode", "S     atellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
storagenode   | 2024-06-14T09:08:23-07:00       DEBUG   contact:chore   Starting cycle  {"Process": "storagenode", "Satellite      ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
storagenode   | 2024-06-14T09:08:23-07:00       DEBUG   contact:chore   Starting cycle  {"Process": "storagenode", "Satellite      ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
storagenode   | 2024-06-14T09:08:23-07:00       DEBUG   contact:chore   Starting cycle  {"Process": "storagenode", "Satellite      ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
storagenode   | 2024-06-14T09:08:23-07:00       DEBUG   contact:chore   Starting cycle  {"Process": "storagenode", "Satellite      ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
storagenode   | 2024-06-14T09:08:23-07:00       DEBUG   version Allowed minimum version from control server.    {"Process": "     storagenode", "Minimum Version": "1.104.5"}
storagenode   | 2024-06-14T09:08:23-07:00       DEBUG   version Running on allowed version.     {"Process": "storagenode", "V     ersion": "1.104.5"}
storagenode   | 2024-06-14T09:08:23-07:00       DEBUG   contact:endpoint        pinged  {"Process": "storagenode", "by": "1wF     TAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "srcAddr": "35.235.123.171:5491"}
storagenode   | 2024-06-14T09:08:23-07:00       INFO    pieces:trash    emptying trash finished {"Process": "storagenode", "S     atellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "elapsed": "162.434795ms"}
storagenode   | 2024-06-14T09:08:23-07:00       INFO    pieces:trash    emptying trash started  {"Process": "storagenode", "S     atellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
storagenode   | 2024-06-14T09:08:23-07:00       DEBUG   contact:endpoint        pinged  {"Process": "storagenode", "by": "1wF     TAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "srcAddr": "35.235.123.171:29204"}
storagenode   | 2024-06-14T09:08:23-07:00       ERROR   piecestore:cache        error getting current used space:       {"Pro     cess": "storagenode", "error": "pieces error: failed to enumerate satellites: node ID: not enough bytes to make a node id; ha     ve 1, need 32", "errorVerbose": "pieces error: failed to enumerate satellites: node ID: not enough bytes to make a node id; h     ave 1, need 32\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:692\n\tstorj.io/storj/storagenode/p     ieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/s     torj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storagenode   | 2024-06-14T09:08:23-07:00       ERROR   services        unexpected shutdown of a runner {"Process": "storagen     ode", "name": "piecestore:cache", "error": "pieces error: failed to enumerate satellites: node ID: not enough bytes to make a      node id; have 1, need 32", "errorVerbose": "pieces error: failed to enumerate satellites: node ID: not enough bytes to make      a node id; have 1, need 32\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:692\n\tstorj.io/storj/s     toragenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n     \tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storagenode   | 2024-06-14T09:08:23-07:00       ERROR   contact:service ping satellite failed   {"Process": "storagenode", "S     atellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "attempts": 1, "error": "ping satellite: rpc: tcp connec     tor failed: rpc: dial tcp 34.159.134.91:7777: operation was canceled", "errorVerbose": "ping satellite: rpc: tcp connector fa     iled: rpc: dial tcp 34.159.134.91:7777: operation was canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"}
storagenode   | 2024-06-14T09:08:23-07:00       INFO    contact:service context cancelled       {"Process": "storagenode", "S     atellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
storagenode   | 2024-06-14T09:08:23-07:00       ERROR   contact:service ping satellite failed   {"Process": "storagenode", "S     atellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "attempts": 1, "error": "ping satellite: rpc: tcp connec     tor failed: rpc: dial tcp 34.126.92.94:7777: operation was canceled", "errorVerbose": "ping satellite: rpc: tcp connector fai     led: rpc: dial tcp 34.126.92.94:7777: operation was canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"}
storagenode   | 2024-06-14T09:08:23-07:00       INFO    contact:service context cancelled       {"Process": "storagenode", "S     atellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
storagenode   | 2024-06-14T09:08:23-07:00       ERROR   nodestats:cache Get pricing-model/join date failed      {"Process": "     storagenode", "error": "context canceled"}
storagenode   | 2024-06-14T09:08:23-07:00       ERROR   contact:service ping satellite failed   {"Process": "storagenode", "S     atellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "attempts": 1, "error": "ping satellite: context cancele     d", "errorVerbose": "ping satellite: context canceled\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:203\     n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:157\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycle     s.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/err     group.(*Group).Go.func1:78"}
storagenode   | 2024-06-14T09:08:23-07:00       INFO    contact:service context cancelled       {"Process": "storagenode", "S     atellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}

And then it just ends up restarting endlessly in a loop with the error.

What I tried:

Making sure I have enough free space on my storage drive and database drive. Making sure network connectivity is good. I also tried just deleting all the .db files like the steps here:

How to fix database: file is not a database error – Storj

Does anybody have any clever ideas on how to salvage the node? All of the data seems to still be safe. A zfs scrub should finish in… 2 days… sigh… to confirm it.

Toyoo · June 14, 2024, 10:58pm

EasyRhino:

storagenode   | 2024-06-14T09:08:23-07:00       ERROR   services        unexpected shutdown of a runner {"Process": "storagen     ode", "name": "piecestore:cache", "error": "pieces error: failed to enumerate satellites: node ID: not enough bytes to make a      node id; have 1, need 32", "errorVerbose": "pieces error: failed to enumerate satellites: node ID: not enough bytes to make      a node id; have 1, need 32\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:692\n\tstorj.io/storj/s     toragenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n     \tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

Translating: You have a file or directory in either the config/storage/blobs or config/storage/trash subdirectories whose name does not have the right number of characters to be an ID of a satellite.

Sounds like a fun issue. I hope you have not lost your data

EasyRhino · June 14, 2024, 11:57pm

Toyoo:

EasyRhino:

storagenode   | 2024-06-14T09:08:23-07:00       ERROR   services        unexpected shutdown of a runner {"Process": "storagen     ode", "name": "piecestore:cache", "error": "pieces error: failed to enumerate satellites: node ID: not enough bytes to make a      node id; have 1, need 32", "errorVerbose": "pieces error: failed to enumerate satellites: node ID: not enough bytes to make      a node id; have 1, need 32\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:692\n\tstorj.io/storj/s     toragenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n     \tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

Translating: You have a file or directory in either the config/storage/blobs or config/storage/trash subdirectories whose name does not have the right number of characters to be an ID of a satellite.

OMG, how many little heart emojis can I give you, and how big is your brain? Answer: it is pretty big.

Here was the directory contents of my blobs folder:

6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa
arej6usf33ki2kukzd5v6xgry2tdr56g45pp3aao6llsaaaaaaaa
hi
pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa
qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa
ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa
v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa

the file called “hi” was there because I did a touch command to make sure I had write permissions correct to the storage drive. I had no idea I was sabotaging my node.

So I removed that ‘hi’ file and it looks like it’s back and running normally.

(I’m also in the situation where none of my nodes are reporting free space remotely close to actual space, so to avoid disk full situations I had to reduce the allocated space a lot and give it a few days for file walker and maybe storj developments to catch up)