Node disk cames to RO state, node offline

karacurt · November 14, 2024, 6:55am

Hi
Today I’ve received mail notification that the node in oflline.
Tried to inspect the log with following:

sudo docker logs storagenode4 2>&1 | grep "ERROR" | tail

And I’m getting the following output:

      --log.custom-level string           custom level overrides for specific loggers in the format NAME1=ERROR,NAME2=WARN,... Only level increment is supported, and only for selected loggers!

What you’re can advice to do in this case?

offtop:
the noded had started in sep 24 and worked fine.
the startup command is:

docker run -d --restart unless-stopped --stop-timeout 300 -p 28972:28967/tcp -p 28972:28967/udp -p 14006:14002 -e WALLET="YYY" -e EMAIL="XXX@XXX.XXX" -e ADDRESS="ZZZ:28972" -e STORAGE="10TB" --user $(id -u):$(id -g) --mount type=bind,source="/var/storj/storagenode4/identity",destination=/app/identity --mount type=bind,source="/var/storj/storagenode4/data",destination=/app/config --mount type=bind,source="/var/storj/db/storagenode4",destination=/app/dbs --log-opt max-size=50m --log-opt max-file=2 --name storagenode4 storjlabs/storagenode:latest

karacurt · November 14, 2024, 7:27am

the other node that runs (in running state for now) on the same machine gives:

sudo docker logs storagenode1 2>&1 | grep "ERROR" | tail

ERROR   Error retrieving version info.  {"Process": "storagenode-updater", "error": "version checker client: Get \"https://version.storj.io\": dial tcp: lookup version.storj.io on 8.8.8.8:53: read udp 172.17.0.2:59318->8.8.8.8:53: i/o timeout", "errorVerbose": "version checker client: Get \"https://version.storj.io\": dial tcp: lookup version.storj.io on 8.8.8.8:53: read udp 172.17.0.2:59318->8.8.8.8:53: i/o timeout\n\tstorj.io/storj/private/version/checker.(*Client).All:68\n\tmain.loopFunc:20\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tmain.cmdRun:139\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tmain.main:22\n\truntime.main:271"}

karacurt · November 14, 2024, 8:53am

According to:

docker logs --tail 20 storagenode

looks like something wrong with storagenode updater:

2024-11-14 08:49:12,228 WARN exited: storagenode (exit status 1; not expected)
2024-11-14 08:49:13,229 INFO gave up: storagenode entered FATAL state, too many start retries too quickly
2024-11-14 08:49:15,232 WARN received SIGQUIT indicating exit request
2024-11-14 08:49:15,233 INFO waiting for processes-exit-eventlistener, storagenode-updater to die
2024-11-14T08:49:15Z    INFO    Got a signal from the OS: "terminated"  {"Process": "storagenode-updater"}
2024-11-14 08:49:15,239 INFO stopped: storagenode-updater (exit status 0)
2024-11-14 08:49:16,241 WARN stopped: processes-exit-eventlistener (terminated by SIGTERM)

I’ve stop+rm the watchtower, rolled back but the node is still in the same condition.

karacurt · November 14, 2024, 9:10am

Looks like I’ve got somehow read only on filesystem:

cat /var/storj/storagenode4/data/node4.log | grep "error" | grep -v "rate" | tail

2024-11-14T02:07:47Z    ERROR   piecestore      upload internal error   {"Process": "storagenode", "error": "pieces error: open config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/rb/o7pqb7lrpszxjedbcbr3mphwiqtv3etpcju5mfefjxungmmkcq.sj1: read-only file system", "errorVerbose": "pieces error: open config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/rb/o7pqb7lrpszxjedbcbr3mphwiqtv3etpcju5mfefjxungmmkcq.sj1: read-only file system\n\tstorj.io/storj/storagenode/pieces.(*Writer).Commit:175\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:518\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:566\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:294\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:62\n\tstorj.io/common/experiment.(*Handler).HandleRPC:43\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:166\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:108\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:156\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2024-11-14T02:07:47Z    ERROR   piecestore      upload failed   {"Process": "storagenode", "Piece ID": "RBO7PQB7LRPSZXJEDBCBR3MPHWIQTV3ETPCJU5MFEFJXUNGMMKCQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.219.44:42302", "Size": 11008, "error": "pieces error: open config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/rb/o7pqb7lrpszxjedbcbr3mphwiqtv3etpcju5mfefjxungmmkcq.sj1: read-only file system", "errorVerbose": "pieces error: open config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/rb/o7pqb7lrpszxjedbcbr3mphwiqtv3etpcju5mfefjxungmmkcq.sj1: read-only file system\n\tstorj.io/storj/storagenode/pieces.(*Writer).Commit:175\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:518\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:566\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:294\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:62\n\tstorj.io/common/experiment.(*Handler).HandleRPC:43\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:166\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:108\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:156\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2024-11-14T02:07:52Z    ERROR   piecestore      upload internal error   {"Process": "storagenode", "error": "pieces error: open config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/bc/krvzzqdyyv73f2qffvo2m2xcvm4ef2ovxcd4cxoiya6q56dj5q.sj1: read-only file system", "errorVerbose": "pieces error: open config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/bc/krvzzqdyyv73f2qffvo2m2xcvm4ef2ovxcd4cxoiya6q56dj5q.sj1: read-only file system\n\tstorj.io/storj/storagenode/pieces.(*Writer).Commit:175\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:518\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:566\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:294\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:62\n\tstorj.io/common/experiment.(*Handler).HandleRPC:43\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:166\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:108\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:156\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2024-11-14T02:07:52Z    ERROR   piecestore      upload failed   {"Process": "storagenode", "Piece ID": "BCKRVZZQDYYV73F2QFFVO2M2XCVM4EF2OVXCD4CXOIYA6Q56DJ5Q", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT", "Remote Address": "207.211.208.130:51772", "Size": 16896, "error": "pieces error: open config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/bc/krvzzqdyyv73f2qffvo2m2xcvm4ef2ovxcd4cxoiya6q56dj5q.sj1: read-only file system", "errorVerbose": "pieces error: open config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/bc/krvzzqdyyv73f2qffvo2m2xcvm4ef2ovxcd4cxoiya6q56dj5q.sj1: read-only file system\n\tstorj.io/storj/storagenode/pieces.(*Writer).Commit:175\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:518\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:566\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:294\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:62\n\tstorj.io/common/experiment.(*Handler).HandleRPC:43\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:166\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:108\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:156\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2024-11-14T02:07:55Z    ERROR   piecestore      upload internal error   {"Process": "storagenode", "error": "pieces error: open config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/ck/nvmrlbaip7gumasrcwvxebzjp7ejcmx736luelycb7sis6ih7q.sj1: read-only file system", "errorVerbose": "pieces error: open config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/ck/nvmrlbaip7gumasrcwvxebzjp7ejcmx736luelycb7sis6ih7q.sj1: read-only file system\n\tstorj.io/storj/storagenode/pieces.(*Writer).Commit:175\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:518\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:566\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:294\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:62\n\tstorj.io/common/experiment.(*Handler).HandleRPC:43\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:166\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:108\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:156\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2024-11-14T02:07:55Z    ERROR   piecestore      upload failed   {"Process": "storagenode", "Piece ID": "CKNVMRLBAIP7GUMASRCWVXEBZJP7EJCMX736LUELYCB7SIS6IH7Q", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.219.44:52558", "Size": 53760, "error": "pieces error: open config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/ck/nvmrlbaip7gumasrcwvxebzjp7ejcmx736luelycb7sis6ih7q.sj1: read-only file system", "errorVerbose": "pieces error: open config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/ck/nvmrlbaip7gumasrcwvxebzjp7ejcmx736luelycb7sis6ih7q.sj1: read-only file system\n\tstorj.io/storj/storagenode/pieces.(*Writer).Commit:175\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:518\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:566\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:294\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:62\n\tstorj.io/common/experiment.(*Handler).HandleRPC:43\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:166\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:108\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:156\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2024-11-14T02:07:55Z    ERROR   services        unexpected shutdown of a runner {"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: error verifying writability of storage directory: open config/storage/write-test4027274399: read-only file system", "errorVerbose": "piecestore monitor: error verifying writability of storage directory: open config/storage/write-test4027274399: read-only file system\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:156\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:137\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-11-14T02:07:55Z    ERROR   gracefulexit:chore      error retrieving satellites.    {"Process": "storagenode", "error": "satellitesdb: context canceled", "errorVerbose": "satellitesdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits.func1:201\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:213\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:55\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:48\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-11-14T02:07:55Z    ERROR   failure during run      {"Process": "storagenode", "error": "piecestore monitor: error verifying writability of storage directory: open config/storage/write-test4027274399: read-only file system", "errorVerbose": "piecestore monitor: error verifying writability of storage directory: open config/storage/write-test4027274399: read-only file system\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:156\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:137\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-11-14T02:07:55Z    FATAL   Unrecoverable error     {"Process": "storagenode", "error": "piecestore monitor: error verifying writability of storage directory: open config/storage/write-test4027274399: read-only file system", "errorVerbose": "piecestore monitor: error verifying writability of storage directory: open config/storage/write-test4027274399: read-only file system\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:156\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:137\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

Dunno how it happen, no changes from myself in a last couple of weeks.
Any thoughts?

edo · November 14, 2024, 12:00pm

Hi @karacurt,

That sounds frustrating! Other SNOs have run into similar read-only issues, often due to permissions, filesystem errors, or hardware settings. Check out these discussions to see if any match your situation: Storj Forum - Read-only file system search.

If these don’t resolve it, feel free to share more details—we’re here to help!

Good luck!

karacurt · November 14, 2024, 2:36pm

ok, it’s obvious to try fsck -ycjp /dev/sda and the look what it get’s.

Alexey · November 15, 2024, 4:39am

It seems to have a network or descriptor issue. Does recreating this container help?

karacurt · November 15, 2024, 6:28am

recreated the node - not work;
port change on router, recreated the node - not work;

then fsck -ycjp /dev/sda and gets the following output:

then the node woke up and is working for now.

Alexey · November 15, 2024, 7:09am

If you have a message, that the FS is modified, you need to run fsck one more time, until it wouldn’t modify a FS.

karacurt · November 15, 2024, 7:23am

ok, thank you, got it.
started fsck again

karacurt · November 18, 2024, 6:20am

Is it nessesary option?
I was running fsck about 6 times and there is still file system was modified.

Alexey · November 18, 2024, 7:00am

Unfortunately yes. It should stop modify the FS when all errors will be fixed. It cannot fix all errors for the one pass.

karacurt · November 18, 2024, 2:13pm

Ok, thank you
I’ll continue to going round in circles

Alexey · November 19, 2024, 6:49am

Sorry. This is only mean that it’s heavily corrupted, so it need to be fixed. The other way is to sync it to another disk, reformat the drive and sync back, it would take a lot of time as you may know.

karacurt · November 19, 2024, 7:22am

Ok, got it:)

It might be the way in this case - there is a new node with 100Gb data located.
And this is strange for me - it’s not migrated, just a preown SAS disk with a 500 hours workout started as new node.

Alexey · November 19, 2024, 7:46am

FS corruption doesn’t popup out of blue, something happened, like the lost power, lost connection to the disk, bug in a controller, power loss, etc.
Also it’s often happen in virtualized environments when you pass through the entire disk/partition.

karacurt · November 19, 2024, 8:09am

Oh, there is so many options in here:

Bug in controller: Adaptec ASA-70165H HBA has been bought on Ali express, it works but I can’t find appropriate firmware for it and have the error about it in dmesg;
CPU errors - no comments, just a bunch of errors in dmesg that cannot be patched;
etc - some other errors in dmesg that I gathered info and described but still have no solution.

The server that I’m running is not some major vendor spec, it’s some kind of barebone monster that I’ve built by myself. So have struggled with parts and the system incompatibility after all had assembled.

So it just out of blue for now - too many questions without an answers.
And maybe something wrong with the HDD itself, cause I’m not a first owner and as it SAS disk I had no appropriate solution to fully inspect this SAS disk.

Alexey · November 19, 2024, 8:15am

Unfortunately this a final point. I was forced to revert my setup back to Windows because of it. Under Linux all FSes become corrupted very fast, one node even was DQ from the Saltlake.
See

karacurt · November 19, 2024, 8:21am

Looks like that(
I’m just want to get an experience in Linux, net building, etc and I’ve never imagined how deep this rabbit hole.
And sometimes there is lonely and scary:)

But I’ve got enough bad trips in Windows, so I will go forward.

karacurt · November 19, 2024, 8:44am

I’ve read your story under the link and this article is about my primal fears of during proccess.
I’m moving slowly, step by step, with a reserved time to make monitoring and measurements with my lack of skills:

Runned the node on Rpi;
Runned the newly configurated node on Linux;
Now I’m in migration process from Win to Linux.

I’m using Debian distro in all of my cases, just to have clear view in the proccess…
And because it’s the first distro that deployed on the server (Ubuntu not worked).

Sometimes I’m freaking out if something have not foreseen and I’m getting in unexpected trouble that I can’t take it out…