Hashstore: bad file descriptor

Gamgnesium · January 18, 2025, 12:37pm

Hello !

This error spawned this morning on my node, it looks like there is an issue with the hashstore. Does anyone know if there is a workaround to solve the issue ? (current
version: v1.120.4)

2025-01-18 12:28:22,595 INFO spawned: ‘storagenode’ with pid 78
2025-01-18T12:28:22Z ERROR failure during run {“Process”: “storagenode”, “error”: “Failed to create storage node peer: hashstore: unable to flock: hashstore: bad file descriptor\n\tstorj.io/storj/storagenode/hashstore.NewStore:105\n\tstorj.io/storj/storagenode/hashstore.New:85\n\tstorj.io/storj/storagenode/piecestore.(*HashStoreBackend).getDB:271\n\tstorj.io/storj/storagenode/piecestore.NewHashStoreBackend:105\n\tstorj.io/storj/storagenode.New:618\n\tmain.cmdRun:84\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272”, “errorVerbose”: “Failed to create storage node peer: hashstore: unable to flock: hashstore: bad file descriptor\n\tstorj.io/storj/storagenode/hashstore.NewStore:105\n\tstorj.io/storj/storagenode/hashstore.New:85\n\tstorj.io/storj/storagenode/piecestore.(*HashStoreBackend).getDB:271\n\tstorj.io/storj/storagenode/piecestore.NewHashStoreBackend:105\n\tstorj.io/storj/storagenode.New:618\n\tmain.cmdRun:84\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272\n\tmain.cmdRun:86\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272”}
Error: Failed to create storage node peer: hashstore: unable to flock: hashstore: bad file descriptor
storj.io/storj/storagenode/hashstore.NewStore:105
storj.io/storj/storagenode/hashstore.New:85
storj.io/storj/storagenode/piecestore.(*HashStoreBackend).getDB:271
storj.io/storj/storagenode/piecestore.NewHashStoreBackend:105
storj.io/storj/storagenode.New:618
main.cmdRun:84
main.newRunCmd.func1:33
storj.io/common/process.cleanup.func1.4:392
storj.io/common/process.cleanup.func1:410
github.com/spf13/cobra.(*Command).execute:983
github.com/spf13/cobra.(*Command).ExecuteC:1115
github.com/spf13/cobra.(*Command).Execute:1039
storj.io/common/process.ExecWithCustomOptions:112
main.main:34
runtime.main:272

EasyRhino · January 18, 2025, 9:15pm

I am also seeing that error on a couple of nodes since yesterday!

bre · January 19, 2025, 12:01am

Hello @Gamgnesium ,

Thank you for alerting us! I’ve passed this on to the team and it will be investigated.
@EasyRhino thank you for confirming you have seen the same error.
A team member will provide more information after the investigation.

thepaul · January 19, 2025, 1:21am

This is perplexing. The most common reason for that error is that the file descriptor isn’t actually open, but looking at the code it seems that’s entirely impossible (it is opened immediately beforehand and the error is properly checked). flock() also doesn’t care if the descriptor is open for read or write, or if the file descriptor refers to a directory (on any of the platforms we support, afaik), so it’s not that.

Most importantly: what platform are you running on? The code is a little different for windows and the semantics can vary slightly between macOS/BSD and Linux, so that might be relevant.

Also: Can you check that those files (I think they should be at storage/hashstore/*/s[01]/meta/lock) and make sure they look like normal files (i.e. not sockets or device nodes or symlinks or fifos, etc)?

EasyRhino · January 19, 2025, 6:04am

for me, i have four nodes affected.

3 are linux/x86 and 1 is linux/ARM (all docker)
one x86 and one ARM are using badger cache, the other two aren’t
Every one has DB’s on a SSD
my remaining working nodes are all running version 1.119.15. the dead ones are running 1.20.4

Gamgnesium · January 19, 2025, 9:03am

Hello,

I am running storj with docker (Version: 27.5.0) on Ubuntu 24.04 (x86):

root@storj:/opt/storj# cat /etc/os-release
PRETTY_NAME=“Ubuntu 24.04.1 LTS”
NAME=“Ubuntu”
VERSION_ID=“24.04”
VERSION=“24.04.1 LTS (Noble Numbat)”
VERSION_CODENAME=noble

For the file type checks, it looks ok:

root@storj:/opt/storj# find /opt/storj/data/storage/hashstore/*/s[01]/meta/lock -type f
/opt/storj/data/storage/hashstore/12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S/s0/meta/lock
root@storj:/opt/storj# file /opt/storj/data/storage/hashstore/12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S/s0/meta/lock
/opt/storj/data/storage/hashstore/12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S/s0/meta/lock: empty

Best regards,

EasyRhino · January 19, 2025, 5:17pm

oh yeah here’s what my lock file looks like on one of my broken nodes:

:/srv/storj14002/storage/hashstore/12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S/s0/meta$ ls -al
total 2
drwxr-sr-x 2 nobody users 3 Jan 14 22:29 .
drwxr-sr-x 3 nobody users 3 Jan 14 22:29 ..
-rw-r--r-- 1 nobody users 0 Jan 14 22:29 lock

oh and the dockers I run all as sudo. so “sudo docker compose up”

zip · January 19, 2025, 6:53pm

Running one hashstore directly as binary on amd64 linux and haven’t seen this issue yet. Might it be related to Docker?

snorkel · January 19, 2025, 7:18pm

I’m not an Linux expert, but shouldn’t there be an X permission?

zip · January 19, 2025, 7:25pm

No x here either:

total 257M
drwxr-xr-x   2 storj storj 4.0K Jan 18 09:26 .
drwxr-xr-x 259 storj storj 4.0K Jan  1 06:10 ..
-rw-------   1 storj storj 257M Jan 18 09:26 hashtbl-0000000000000078
-rw-r--r--   1 storj storj    0 Dec  6 02:59 lock

thepaul · January 19, 2025, 10:09pm

No, it doesn’t need an x permission.

thepaul · January 20, 2025, 5:38am

Are these lock files, by any chance, on a networked file system like NFS? Or some other unusual type of fs?

EasyRhino · January 20, 2025, 7:11am

mine are indeed EXACTLY on NFS! All of em.

my database is on local SSD, but the bulk storage folder (where this lock file is) is on a NFS share.

Roberto · January 20, 2025, 7:54am

thepaul · January 20, 2025, 3:03pm

Aha! That makes sense then. NFS is, as Roberto said, not supported or recommended, and the network drive latency could really affect your performance.

If you are determined to use NFS despite these warnings, well, flock() type locks do in fact work over NFS if it’s configured properly. The internet should be able to tell you how to diagnose and make sure all the necessary services are running and able to address each other.

EasyRhino · January 20, 2025, 7:42pm

Well, thanks for the response. my initial google attempts for how to configure NFS for file locking were unsuccessful.

I think my first question is, is it possible to relocate the hashstore directory to a local drive (similar to moving the databases to SSD)? config.yaml or some other option?

Alexey · January 21, 2025, 3:37am

Not with an option, but you can use symlinks or --mount type=bind option in your docker run command.
See

However, it’s not known yet how much is it safe, since the hashtable regeneration is not implemented yet.

https://man7.org/linux/man-pages/man5/nfs.5.html#:~:text=local_lock%3Dmechanism

EasyRhino · January 22, 2025, 4:58pm

Thank you Alexey and thepaul. Here’s where I’m at.

I’m not smart enough to get storj happy with locking the files on my NFS server. Maybe it’s a client problem, maybe it’s a problem with my truenas setting, maybe there’s no normal NFS locking option that works with the flock that storj is attempting. I dunno.

However, I was able to get the nodes running… well enough? my defining a local drive as volume in my docker compose. Either of these syntaxes worked:

    volumes:
      - /srv/storj14002:/app/config
      - /home/mark/.local/share/storj14002/identity/storagenode:/app/identity
      - /home/mark/storj14002/storage:/app/dbs
      - /home/mark/storj14002/hashstore:/app/config/storage/hashstore

or

    volumes:
      - /home/ubuntu/.local/share/storj/identity/storj2/storagenode:/app/identi>
      - type: bind
        source: /srv/storj2
        target: /app/config
      - type: bind
        source: /home/ubuntu/storj2/storage
        target: /app/dbs
      - type: bind
        source: /home/ubuntu/storj2/badger
        target: /app/config/storage/filestatcache
      - type: bind
        source: /home/ubuntu/storj2/hashstore
        target: /app/config/storage/hashstore

And the nodes seem to be up and running and uploading and download okay. I mean, i guess, the log isn’t full of errors or anything.

HOWEVER, what’s interesting is the resulting setup:

the local SSH ‘home folder’ mapping drive only has a single “meta” folder and a basic .migrate file and nothing else in it.

the mounted NFS storage drive that has everything else also has a hasstore folder and in here are some .bloomfilter files that seem to have actual data in them.

So in other words the only thing that’s locally mounted seems to be the lock file and the actual hashstore stuff is still on the forbidden NFS drive.

This may mean that storj trying to flock the meta file is unnecessary in the first place.

bney · January 23, 2025, 3:59am

My node randomly started having this issue earlier today after running for nearly 6 months. I run storagenode natively on Linux and have my storage on the forbidden nfs mount over a dedicated 10 Gig fiber connection. I relocated the hashtable to a local drive and symlinked it, now all is well

Alexey · January 23, 2025, 6:53am

It’s not forbidden, it’s not supported
So, you need to do a research, how to allow the NFS client to make a support of flock.

By the way, I tried hashstore with a CIFS mount in storj-up and it’s working so far. Of course, not so much data there, it’s not prod, but I was unable to got this issue so far.

For SMB I was forced to use nobrl option to make SQLite happy, this seems also allowed me to avoid the flock issue.

However, I saw a PR, that should solve this issue for NFS too: https://review.dev.storj.io/c/storj/storj/+/16046