Hashstore error preventing node restart

Looks like the machine behind my nodes had a power failure recently. 4/5 nodes came up fine. The last one is failing to start up, it logs successfully starting up two satellites for hashstore then logs this and aborts:

failure during run      {"Process": "storagenode", "error": "Failed to create storage node peer: hashstore: logSlots calculation mismatch: size=34603008 logSlots=19\n\tstorj.io/storj/storagenode/hashstore.OpenHashtbl:116\n\tstorj.io/storj/storagenode/hashstore.OpenTable:121\n\tstorj.io/storj/storagenode/hashstore.NewStore:258\n\tstorj.io/storj/storagenode/hashstore.New:93\n\tstorj.io/storj/storagenode/piecestore.(*HashStoreBackend).getDB:248\n\tstorj.io/storj/storagenode/piecestore.NewHashStoreBackend:114\n\tstorj.io/storj/storagenode.New:598\n\tmain.cmdRun:84\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272", "errorVerbose": "Failed to create storage node peer: hashstore: logSlots calculation mismatch: size=34603008 logSlots=19\n\tstorj.io/storj/storagenode/hashstore.OpenHashtbl:116\n\tstorj.io/storj/storagenode/hashstore.OpenTable:121\n\tstorj.io/storj/storagenode/hashstore.NewStore:258\n\tstorj.io/storj/storagenode/hashstore.New:93\n\tstorj.io/storj/storagenode/piecestore.(*HashStoreBackend).getDB:248\n\tstorj.io/storj/storagenode/piecestore.NewHashStoreBackend:114\n\tstorj.io/storj/storagenode.New:598\n\tmain.cmdRun:84\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272\n\tmain.cmdRun:86\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272"}

I use ZFS for the filesystem under all of my nodes, but ironically this is the only one that’s redundant - the others I just use it for consistency in system config, but 1 drive at a time. From ZFS’ perspective, there are no data errors.

This node appears to be running v1.126.2.

Hello @bryanpendleton,
Welcome to the forum!

Seems your hashstore has been corrupted.
I would share this with the team

Since this doesn’t seem to be a problem with all of the node’s data but just the one satellite, yet the error is causing the node to refuse to start up at all, is there any suggestion for how to bring the node up to serve the satellites it doesn’t have corrupted data for?

Put the one satellite on the exclusion list in config and restart. Wait for an official response though. I don’t know if it will disable that sat permanently or temporary.

For ex. to disable the Saltlake sat, put this in config, or edit the existing line:

# list of trust exclusions
storage2.trust.exclusions: "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE@saltlake.tardigrade.io:7777"
1 Like

Any updates here? I’m not even sure which satellite is the one that’s failing.

Unfortunately not. I also do not know is there a workaround.

Does this mean that a node using hashstore can die because of an error in a single file???

That’s what tech preview means.

1 Like

Unfortunately yes.

However, a ZFS system with CoW (copy-on-write) should lower the risk quite significantly - and perhaps that is why only 1 one with 1 sat is experiencing this issue. Imagine how the situation for @bryanpendleton would be if this was EXT4 partitions with Hashstore :scream:

It has been raised as a concern many times, and extensively discussed in the hashstore main thread here [Tech Preview] Hashstore backend for storage nodes

Most recommendations so far goes in the direction that a UPS is a very smart addition to your setup, in case you opt in for hashstore.

It has also been discussed, that a repair / rebuild tool perhaps would see the light of day in the future (this is not yet announced by StorJ)

It has been merged. You can build it with latest go:

git clone git@github.com:storj/storj.git && cd storj
go install ./cmd/write-hashtbl

Then in the ~/bin subfolder there should be a binary write-hashtbl.
You may see the help

~/bin/write-hashtbl --help

the easiest way is to just use the default flags and pass it one of the store directories, so like

~/bin/write-hashtbl /mnt/storj/storagenode/storage/hashstore/12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs/s0

and do the same with /s1

You may also use docker to build the local image with the binary to do not install all developers tools, like it is described there (you need to replace the command, of course, to build this tool, not benchmarks):

Please note, the tool would place the generated hashtables in the current directory, so you would need to move them to the proper folder, or you may start this command from the proper directory as well, i.e.

cd /mnt/storj/storagenode/storage/hashstore/12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs/s0
mv meta meta.bak
mkdir meta
cd meta
~/bin/write-hashtbl /mnt/storj/storagenode/storage/hashstore/12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs/s0
5 Likes

For anyone following along, I did the rebuild steps a month ago and my node was able to come back online and has been working since.

4 Likes

How much data was in it and how long did it take to build?

I would like to know more about the procedure, have instructions and a ready-made solution for Windows - is there one?

Nobody has submit such PR so far, I would like to invite you to be a first one :slight_smile:

Right now you can build this tool by installing GO for Windows, then use

then run like described in the same post

cd X:/storagenode/storage/hashstore/12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs/s0
mv meta meta.bak
mkdir meta
cd meta
~/bin/write-hashtbl X:/storagenode/storage/hashstore/12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs/s0

But maybe easier to use docker approach mentioned earlier. It doesn’t require to install all developers tools to build this utility:

  1. Create a Dockerfile
FROM golang as build
RUN git clone https://github.com/storj/storj.git && \
    cd storj && \
    go install ./cmd/write-hashtbl

FROM ubuntu
WORKDIR /meta
COPY --from=build go/bin/write-hashtbl /usr/bin/
  1. build:
docker build . -t storj-write-hashtbl
  1. now restore (PowerShell) s0:
cd X:/storagenode/storage/hashstore/12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs/s0
mv meta meta.bak
mkdir meta

docker run -it --rm -v ${PWD}:/hashstore -v ${PWD}/meta:/meta storj-write-hashtbl write-hashtbl /hashstore
  1. do the same for s1 (PowerShell)
cd X:/storagenode/storage/hashstore/12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs/s1
mv meta meta.bak
mkdir meta

docker run -it --rm -v ${PWD}:/hashstore -v ${PWD}/meta:/meta storj-write-hashtbl write-hashtbl /hashstore
1 Like

I think the satellite that was broken had a couple of hundred GiB, but I’m not sure. The rebuild took quite a few hours, but less than a day IIRC.

1 Like