Hashstore error preventing node restart

Looks like the machine behind my nodes had a power failure recently. 4/5 nodes came up fine. The last one is failing to start up, it logs successfully starting up two satellites for hashstore then logs this and aborts:

failure during run      {"Process": "storagenode", "error": "Failed to create storage node peer: hashstore: logSlots calculation mismatch: size=34603008 logSlots=19\n\tstorj.io/storj/storagenode/hashstore.OpenHashtbl:116\n\tstorj.io/storj/storagenode/hashstore.OpenTable:121\n\tstorj.io/storj/storagenode/hashstore.NewStore:258\n\tstorj.io/storj/storagenode/hashstore.New:93\n\tstorj.io/storj/storagenode/piecestore.(*HashStoreBackend).getDB:248\n\tstorj.io/storj/storagenode/piecestore.NewHashStoreBackend:114\n\tstorj.io/storj/storagenode.New:598\n\tmain.cmdRun:84\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272", "errorVerbose": "Failed to create storage node peer: hashstore: logSlots calculation mismatch: size=34603008 logSlots=19\n\tstorj.io/storj/storagenode/hashstore.OpenHashtbl:116\n\tstorj.io/storj/storagenode/hashstore.OpenTable:121\n\tstorj.io/storj/storagenode/hashstore.NewStore:258\n\tstorj.io/storj/storagenode/hashstore.New:93\n\tstorj.io/storj/storagenode/piecestore.(*HashStoreBackend).getDB:248\n\tstorj.io/storj/storagenode/piecestore.NewHashStoreBackend:114\n\tstorj.io/storj/storagenode.New:598\n\tmain.cmdRun:84\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272\n\tmain.cmdRun:86\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272"}

I use ZFS for the filesystem under all of my nodes, but ironically this is the only one that’s redundant - the others I just use it for consistency in system config, but 1 drive at a time. From ZFS’ perspective, there are no data errors.

This node appears to be running v1.126.2.

Hello @bryanpendleton,
Welcome to the forum!

Seems your hashstore has been corrupted.
I would share this with the team

Since this doesn’t seem to be a problem with all of the node’s data but just the one satellite, yet the error is causing the node to refuse to start up at all, is there any suggestion for how to bring the node up to serve the satellites it doesn’t have corrupted data for?

Put the one satellite on the exclusion list in config and restart. Wait for an official response though. I don’t know if it will disable that sat permanently or temporary.

For ex. to disable the Saltlake sat, put this in config, or edit the existing line:

# list of trust exclusions
storage2.trust.exclusions: "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE@saltlake.tardigrade.io:7777"
1 Like

Any updates here? I’m not even sure which satellite is the one that’s failing.

Unfortunately not. I also do not know is there a workaround.

Does this mean that a node using hashstore can die because of an error in a single file???

That’s what tech preview means.

1 Like

Unfortunately yes.

However, a ZFS system with CoW (copy-on-write) should lower the risk quite significantly - and perhaps that is why only 1 one with 1 sat is experiencing this issue. Imagine how the situation for @bryanpendleton would be if this was EXT4 partitions with Hashstore :scream:

It has been raised as a concern many times, and extensively discussed in the hashstore main thread here [Tech Preview] Hashstore backend for storage nodes

Most recommendations so far goes in the direction that a UPS is a very smart addition to your setup, in case you opt in for hashstore.

It has also been discussed, that a repair / rebuild tool perhaps would see the light of day in the future (this is not yet announced by StorJ)

It has been merged. You can build it with latest go:

git clone git@github.com:storj/storj.git && cd storj
go install ./cmd/write-hashtbl

Then in the ~/bin subfolder there should be a binary write-hashtbl.
You may see the help

~/bin/write-hashtbl --help

the easiest way is to just use the default flags and pass it one of the store directories, so like

~/bin/write-hashtbl /mnt/storj/storagenode/storage/hashstore/12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs/s0

and do the same with /s1

You may also use docker to build the local image with the binary to do not install all developers tools, like it is described there (you need to replace the command, of course, to build this tool, not benchmarks):

Please note, the tool would place the generated hashtables in the current directory, so you would need to move them to the proper folder, or you may start this command from the proper directory as well, i.e.

cd /mnt/storj/storagenode/storage/hashstore/12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs/s0
mv meta meta.bak
mkdir meta
cd meta
~/bin/write-hashtbl /mnt/storj/storagenode/storage/hashstore/12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs/s0
4 Likes