Node keeps exiting randomly (upload internal error)

Storgeez · July 28, 2024, 9:17pm

What is this now?

24-07-28T23:00:36+02:00	ERROR	piecestore	piecestore/endpoint.go:557	upload internal error	{"Process": "storagenode", "error": "manager closed: closed: read tcp 192.168.1.120:28967->79.127.205.229:45256: read: connection reset by peer", "errorVerbose": "manager closed: closed: read tcp 192.168.1.120:28967->79.127.205.229:45256: read: connection reset by peer\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).read:68\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:113\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:229"}
storj.io/storj/storagenode/piecestore.(*Endpoint).Upload
	/go/src/storj.io/storj/storagenode/piecestore/endpoint.go:557
storj.io/common/pb.DRPCPiecestoreDescription.Method.func1
	/go/pkg/mod/storj.io/common@v0.0.0-20240604134154-517cce55bb8c/pb/piecestore2_drpc.pb.go:294
storj.io/drpc/drpcmux.(*Mux).HandleRPC
	/go/pkg/mod/storj.io/drpc@v0.0.34/drpcmux/handle_rpc.go:33
storj.io/common/rpc/rpctracing.(*Handler).HandleRPC
	/go/pkg/mod/storj.io/common@v0.0.0-20240604134154-517cce55bb8c/rpc/rpctracing/handler.go:61
storj.io/common/experiment.(*Handler).HandleRPC
	/go/pkg/mod/storj.io/common@v0.0.0-20240604134154-517cce55bb8c/experiment/import.go:42
storj.io/drpc/drpcserver.(*Server).handleRPC
	/go/pkg/mod/storj.io/drpc@v0.0.34/drpcserver/server.go:167
storj.io/drpc/drpcserver.(*Server).ServeOne
	/go/pkg/mod/storj.io/drpc@v0.0.34/drpcserver/server.go:109
storj.io/drpc/drpcserver.(*Server).Serve.func2
	/go/pkg/mod/storj.io/drpc@v0.0.34/drpcserver/server.go:157
storj.io/drpc/drpcctx.(*Tracker).track
	/go/pkg/mod/storj.io/drpc@v0.0.34/drpcctx/tracker.go:35

The node exited after outputitng hundreds or thousands of lines to stdout:

...
goroutine 260580 gp=0xc0214fddc0 m=nil [runnable]:
context.(*cancelCtx).propagateCancel.func2()
        /usr/local/go/src/context/context.go:509 fp=0xc02191d7e0 sp=0xc02191d7d8 pc=0x553040
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc02191d7e8 sp=0xc02191d7e0 pc=0x477ac1
created by context.(*cancelCtx).propagateCancel in goroutine 191307
        /usr/local/go/src/context/context.go:509 +0x3f3
  
goroutine 260581 gp=0xc021920000 m=nil [sync.Cond.Wait]:
runtime.gopark(0xc00a05f960?, 0xc02191dd38?, 0x50?, 0xdd?, 0xc02191dd90?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc02191dd58 sp=0xc02191dd38 pc=0x4401ce
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:408
sync.runtime_notifyListWait(0xc00a05f9b8, 0x0)
        /usr/local/go/src/runtime/sema.go:569 +0x159 fp=0xc02191dda8 sp=0xc02191dd58 pc=0x4739d9
sync.(*Cond).Wait(0xc02191ddc0?)
        /usr/local/go/src/sync/cond.go:70 +0x85 fp=0xc02191dde8 sp=0xc02191dda8 pc=0x481385
storj.io/drpc/drpcstream.(*packetBuffer).Get(0xc00a05f9a0)
        /go/pkg/mod/storj.io/drpc@v0.0.34/drpcstream/pktbuf.go:65 +0x8b fp=0xc02191de48 sp=0xc02191dde8 pc=0xaf02cb
storj.io/drpc/drpcstream.(*Stream).MsgRecv(0xc00a05f8c8, {0x14a6940, 0xc018d0c460}, {0x18d6b38, 0x272af40})
        /go/pkg/mod/storj.io/drpc@v0.0.34/drpcstream/stream.go:485 +0x105 fp=0xc02191ded8 sp=0xc02191de48 pc=0xaf27e5
storj.io/common/experiment.(*streamWrapper).MsgRecv(0xc02191df38?, {0x14a6940?, 0xc018d0c460?}, {0x18d6b38?, 0x272af40?})
        <autogenerated>:1 +0x38 fp=0xc02191df10 sp=0xc02191ded8 pc=0xae6e38
storj.io/common/rpc/rpctracing.(*streamWrapper).MsgRecv(0x0?, {0x14a6940?, 0xc018d0c460?}, {0x18d6b38?, 0x272af40?})
        <autogenerated>:1 +0x38 fp=0xc02191df48 sp=0xc02191df10 pc=0xa9a6b8
storj.io/common/pb.(*drpcPiecestore_UploadStream).Recv(0xc02190a300)
        /go/pkg/mod/storj.io/common@v0.0.0-20240604134154-517cce55bb8c/pb/piecestore2_drpc.pb.go:387 +0x54 fp=0xc02191df88 sp=0xc02191df48 pc=0x8fc4d4
storj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func1({0x0?, 0x0?})
        /go/src/storj.io/storj/storagenode/piecestore/endpoint.go:299 +0x25 fp=0xc02191dfa8 sp=0xc02191df88 pc=0xc9fe25
storj.io/common/rpc/rpctimeout.Run.func1()
        /go/pkg/mod/storj.io/common@v0.0.0-20240604134154-517cce55bb8c/rpc/rpctimeout/timeout.go:22 +0x2b fp=0xc02191dfe0 sp=0xc02191dfa8 pc=0xc81e8b
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc02191dfe8 sp=0xc02191dfe0 pc=0x477ac1
created by storj.io/common/rpc/rpctimeout.Run in goroutine 191307
        /go/pkg/mod/storj.io/common@v0.0.0-20240604134154-517cce55bb8c/rpc/rpctimeout/timeout.go:22 +0xe8

rax    0x0
rbx    0x6
rcx    0x1201f2a
rdx    0x0
rdi    0x192e5
rsi    0x6
rbp    0x7fffdfffdba0
rsp    0x7fffdfffdb78
r8     0x0
r9     0x0
r10    0x0
r11    0x247
r12    0x84b7f8980
r13    0x0
r14    0xc000006a80
r15    0x1ffffffffffff
rip    0x1201f0a
rflags 0x212
cs     0x43
fs     0x13
gs     0x1b

Alexey · July 29, 2024, 3:22am

The other side is closed the connection. Do you have an Unrecoverable or FATAL error somewhere earlier?
Because my nodes have the similar error

however, no one is crashed with a go panic stack trace. Perhaps a hardware issue?

Storgeez · July 29, 2024, 7:16pm

I just figured out that the storagenode process exited, but somehow it was still using storage with usage pattern similar to filewalker. Restarted all that, now we’ll see if that was the issue. Should be no hardware issues on that node. There was no fatal error the first time, but it might’ve gotten buried by thousands of lines of trash outputted on stdout so I might’ve not seen it. Second time there was a fatal error due to non-writable storage so I went to check the stats. Never seen this before, everything else is normal, might be some kind of hiccup after 1.5 months uptime.

Alexey · July 30, 2024, 7:48am

It’s possible, if the lazy mode is enabled (it’s enabled by default), almost all filewalkers are spawned as a separate processes.
If you use docker, you may see all processes with this command:

docker top storagenode

Storgeez · July 30, 2024, 9:00pm

Sadly I didn’t check after that specific exit. Happened again, something is causing this issue, it wasn’t happening before. I’ve updated the node to 1.109.2, will see if manifests again…

Storgeez · August 1, 2024, 9:05pm

Nope, node exited again with thousands of lines of code in stdout and last log excerpt has a long list of “error sending hash and order limit” and “download failed” with write: broken pipe errors. The node has stayed running even though the command has exited. I have 2 storagenode processes remaining running, utilizing 60% of the disk and 1.3 GB and 1.75 GB of RAM. There are no further log entries after the command exited. Restarting the node works normally. No fatal errors in the last million lines of log that I could find.

Alexey · August 2, 2024, 5:07am

Do you have an Unrecoverable/FATAL errors or “shutdown” log lines before that happen?
I’m sure that the root cause is not the canceled upload, It is likely that upload cancellations and other failures are a result of the process being terminated.
However, lazy filewalkers may remain, and this is likely a bug.

Storgeez · August 2, 2024, 10:33pm

i checked for fatal in the last million lines when it happened, didn’t check for shutdown or unrecoverable words. Will check that if it happens the same way again. But basically it looked like it entered some kind of perpetual error state that it wouldn’t get out of for a long time. That node has been quiet for a long time, it only started after I updated to v1.108, unless it’s a coincidence.

Alexey · August 3, 2024, 2:56am

Usually panics related to a hardware issues, so you could be right about a coincidence.
Do it hangs after that?

Storgeez · August 4, 2024, 9:25am

Got a non-writable storage error again, storagenode exited, filewalker or something stayed running again after the main process exited.

Log content:

024-08-04T10:32:30+02:00	INFO	lazyfilewalker.gc-filewalker	lazyfilewalker/process.go:81	subprocess exited with status	{"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "status": -1, "error": "signal: killed"}
2024-08-04T10:32:30+02:00	ERROR	pieces	pieces/store.go:569	lazyfilewalker failed	{"Process": "storagenode", "error": "lazyfilewalker: signal: killed", "errorVerbose": "lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkSatellitePiecesToTrash:160\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:565\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:379\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:265\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash
	/go/src/storj.io/storj/storagenode/pieces/store.go:569
storj.io/storj/storagenode/retain.(*Service).retainPieces
	/go/src/storj.io/storj/storagenode/retain/retain.go:379
storj.io/storj/storagenode/retain.(*Service).Run.func2
	/go/src/storj.io/storj/storagenode/retain/retain.go:265
golang.org/x/sync/errgroup.(*Group).Go.func1
	/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
2024-08-04T10:32:30+02:00	ERROR	filewalker	pieces/filewalker.go:156	failed to get progress from database	{"Process": "storagenode"}
storj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash
	/go/src/storj.io/storj/storagenode/pieces/filewalker.go:156
storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash
	/go/src/storj.io/storj/storagenode/pieces/store.go:572
storj.io/storj/storagenode/retain.(*Service).retainPieces
	/go/src/storj.io/storj/storagenode/retain/retain.go:379
storj.io/storj/storagenode/retain.(*Service).Run.func2
	/go/src/storj.io/storj/storagenode/retain/retain.go:265
golang.org/x/sync/errgroup.(*Group).Go.func1
	/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
2024-08-04T10:32:30+02:00	DEBUG	db	filestore/dir.go:877	number of subdirs	{"Process": "storagenode", "count": 1024}
2024-08-04T10:32:30+02:00	ERROR	retain	retain/retain.go:267	retain pieces failed	{"Process": "storagenode", "cachePath": "/Storj/Storj-v3-data/retain", "error": "retain: filewalker: context canceled", "errorVerbose": "retain: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:181\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:572\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:379\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:265\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj.io/storj/storagenode/retain.(*Service).Run.func2
	/go/src/storj.io/storj/storagenode/retain/retain.go:267
golang.org/x/sync/errgroup.(*Group).Go.func1
	/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
2024-08-04T10:32:31+02:00	ERROR	storagenode/cmd_run.go:50	failure during run	{"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
main.cmdRun.func1
	/go/src/storj.io/storj/cmd/storagenode/cmd_run.go:50
main.cmdRun
	/go/src/storj.io/storj/cmd/storagenode/cmd_run.go:128
main.newRunCmd.func1
	/go/src/storj.io/storj/cmd/storagenode/cmd_run.go:33
storj.io/common/process.cleanup.func1.4
	/go/pkg/mod/storj.io/common@v0.0.0-20240711155019-3bfa00555ba3/process/exec_conf.go:392
storj.io/common/process.cleanup.func1
	/go/pkg/mod/storj.io/common@v0.0.0-20240711155019-3bfa00555ba3/process/exec_conf.go:410
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039
storj.io/common/process.ExecWithCustomOptions
	/go/pkg/mod/storj.io/common@v0.0.0-20240711155019-3bfa00555ba3/process/exec_conf.go:112
main.main
	/go/src/storj.io/storj/cmd/storagenode/main.go:34
runtime.main
	/usr/local/go/src/runtime/proc.go:271
2024-08-04T10:32:31+02:00	FATAL	process/exec_conf.go:428	Unrecoverable error	{"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj.io/common/process.cleanup.func1
	/go/pkg/mod/storj.io/common@v0.0.0-20240711155019-3bfa00555ba3/process/exec_conf.go:428
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039
storj.io/common/process.ExecWithCustomOptions
	/go/pkg/mod/storj.io/common@v0.0.0-20240711155019-3bfa00555ba3/process/exec_conf.go:112
main.main
	/go/src/storj.io/storj/cmd/storagenode/main.go:34
runtime.main
	/usr/local/go/src/runtime/proc.go:271

The node keeps failing before finishing the initial filewalk, disk has been at 90%-100% continuously for 2-3 days, maybe some kind of hiccup of either the storagenode or storage under long sustained load. It has overshot the assigned storage by nearly 5 TiB, I have now reduced the assigned storage space to stop uploads. I don’t know how it overshot the alotted space so much.

Alexey · August 4, 2024, 10:28am

What’s a beautiful error. Do you know, why the OS killed it? Seems OOM?
In that specific case I would like to suggest to disable the lazy mode and restart the node. I would hope that it would allow the filewalker to finish the scan.

And very interesting to see, what’s OS, how much RAM and how many nodes here.

That’s easy one… Since filewalkers cannot finish the scan, the databases are not updated with the actual usage, so the node reports to the satellites that it’s still has a free space (where is not).
It will stop uploads only when there would be less than 5GB either in the allocation or on the disk…

Storgeez · August 4, 2024, 10:58am

OS is FreeBSD, 32GB RAM, 1 node. I don’t know why it got killed, is there a place to check for that? I disabled the lazy mode filewalker, will see if that makes it finish quickly. Didn’t realize lazy mode was set as default.

I understand the used space is wrong because file scan didn’t finish but I don’t know how it got to that state, it used to be within 1-2 TB… I think things shifted during that heavy ingress, I wasn’t paying that much attention.

Alexey · August 4, 2024, 11:07am

This usually happen, if the disk subsystem is slow for some reason.
However, I didn’t run FreeBSD for a long time, so cannot remember, where to search for the killing processes messages… They should be in the system journals, but I do not remember where they are on FreeBSD, sorry…
I would like to summon @arrogantrabbit, who still uses FreeBSD until now.

TWH · August 4, 2024, 4:08pm

I don’t know if his is related to your issues, but i have noticed that after updating to version 1.108.3 and 1.109.2 that my node seems to be crashing randomly. I am getting “verifying writability” errors. I see in the dmesg that the system is getting a hung_task error. I increased the verify-dir-writable-timeout in my node’s config.yaml and that seems to have stopped the crashes. The node is a 8 core, 12GB RAM, Debian VM, using a mdadm array for blob storage and an SSD for DB’s. The mdadm array reports no errors. The drives are only two years old and were fully tested before put into operation. Maybe a drive in my array is going bad and this is just a coincidence, i also updated docker around this point in time. On a side note, it would be nice if we could have an option to limit the minimum time between successful used space filewalkes to once a week, my node updated like 3 times in a week and each time its running all the filewalkers again, this is not good for the health of the drives.

Storj Node Error’s:

root@Storj:# docker logs storagenode 2>&1 | grep -E "verifying writability"
2024-08-01T10:32:22Z    ERROR   services        unexpected shutdown of a runner {"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-08-01T10:32:29Z    ERROR   failure during run      {"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
Error: piecestore monitor: timed out after 1m0s while verifying writability of storage directory
2024-08-01T14:38:51Z    ERROR   services        unexpected shutdown of a runner {"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-08-01T14:39:03Z    ERROR   failure during run      {"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
Error: piecestore monitor: timed out after 1m0s while verifying writability of storage directory

Dmesg Errors:

[101378.039043] INFO: task jbd2/sdb-8:464 blocked for more than 120 seconds.
[101378.039128]       Not tainted 5.10.0-30-amd64 #1 Debian 5.10.218-1
[101378.039155] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[101378.039199] task:jbd2/sdb-8      state:D stack:    0 pid:  464 ppid:     2 flags:0x00004000
[101378.039225] Call Trace:
[101378.039300]  __schedule+0x282/0x870
[101378.039340]  ? bit_wait+0x60/0x60
[101378.039344]  schedule+0x46/0xb0
[101378.039349]  io_schedule+0x42/0x70
[101378.039353]  bit_wait_io+0xd/0x60
[101378.039357]  __wait_on_bit+0x2a/0x90
[101378.039362]  out_of_line_wait_on_bit+0x92/0xb0
[101378.039371]  ? var_wake_function+0x30/0x30
[101378.039452]  jbd2_journal_commit_transaction+0x16ac/0x1ad0 [jbd2]
[101378.039482]  kjournald2+0xab/0x270 [jbd2]
[101378.039494]  ? add_wait_queue_exclusive+0x70/0x70
[101378.039503]  ? load_superblock.part.0+0xb0/0xb0 [jbd2]
[101378.039509]  kthread+0x11b/0x140
[101378.039514]  ? __kthread_bind_mask+0x60/0x60
[101378.039520]  ret_from_fork+0x22/0x30
[101619.700143] INFO: task jbd2/sdb-8:464 blocked for more than 120 seconds.
[101619.700222]       Not tainted 5.10.0-30-amd64 #1 Debian 5.10.218-1
[101619.700249] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[101619.700294] task:jbd2/sdb-8      state:D stack:    0 pid:  464 ppid:     2 flags:0x00004000
[101619.700319] Call Trace:
[101619.700382]  __schedule+0x282/0x870
[101619.700660]  ? bit_wait+0x60/0x60
[101619.700670]  schedule+0x46/0xb0
[101619.700675]  io_schedule+0x42/0x70
[101619.700679]  bit_wait_io+0xd/0x60
[101619.700683]  __wait_on_bit+0x2a/0x90
[101619.700688]  out_of_line_wait_on_bit+0x92/0xb0
[101619.700697]  ? var_wake_function+0x30/0x30
[101619.700786]  jbd2_journal_commit_transaction+0x16ac/0x1ad0 [jbd2]
[101619.700817]  kjournald2+0xab/0x270 [jbd2]
[101619.700837]  ? add_wait_queue_exclusive+0x70/0x70
[101619.700857]  ? load_superblock.part.0+0xb0/0xb0 [jbd2]
[101619.700873]  kthread+0x11b/0x140
[101619.700881]  ? __kthread_bind_mask+0x60/0x60
[101619.700892]  ret_from_fork+0x22/0x30
[101718.040308] docker0: port 1(veth88981f3) entered disabled state
[101718.040452] veth23f48ce: renamed from eth0
[101718.099849] docker0: port 1(veth88981f3) entered disabled state
[101718.100834] device veth88981f3 left promiscuous mode
[101718.100846] docker0: port 1(veth88981f3) entered disabled state
[101718.224335] docker0: port 1(vetha26b4df) entered blocking state
[101718.224342] docker0: port 1(vetha26b4df) entered disabled state
[101718.224485] device vetha26b4df entered promiscuous mode
[101718.224701] docker0: port 1(vetha26b4df) entered blocking state
[101718.224706] docker0: port 1(vetha26b4df) entered forwarding state
[101719.004446] eth0: renamed from veth1f28bd7
[101719.042750] IPv6: ADDRCONF(NETDEV_CHANGE): vetha26b4df: link becomes ready
[104157.148546] INFO: task jbd2/sdb-8:464 blocked for more than 120 seconds.
[104157.148632]       Not tainted 5.10.0-30-amd64 #1 Debian 5.10.218-1
[104157.148664] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[104157.148724] task:jbd2/sdb-8      state:D stack:    0 pid:  464 ppid:     2 flags:0x00004000
[104157.148763] Call Trace:
[104157.148880]  __schedule+0x282/0x870
[104157.148995]  ? bit_wait+0x60/0x60
[104157.149003]  schedule+0x46/0xb0
[104157.149010]  io_schedule+0x42/0x70
[104157.149015]  bit_wait_io+0xd/0x60
[104157.149022]  __wait_on_bit+0x2a/0x90
[104157.149028]  out_of_line_wait_on_bit+0x92/0xb0
[104157.149039]  ? var_wake_function+0x30/0x30
[104157.149172]  jbd2_journal_commit_transaction+0x16ac/0x1ad0 [jbd2]
[104157.149222]  kjournald2+0xab/0x270 [jbd2]
[104157.149245]  ? add_wait_queue_exclusive+0x70/0x70
[104157.149258]  ? load_superblock.part.0+0xb0/0xb0 [jbd2]
[104157.149266]  kthread+0x11b/0x140
[104157.149274]  ? __kthread_bind_mask+0x60/0x60
[104157.149284]  ret_from_fork+0x22/0x30

Relevant Changes To config.yaml:

# Leave the interval as is until you increase a timeout more than interval.
#storage2.monitor.verify-dir-readable-timeout: 1m30s
#storage2.monitor.verify-dir-readable-interval: 1m30s
storage2.monitor.verify-dir-writable-timeout: 1m30s
#storage2.monitor.verify-dir-writable-interval: 5m0s

Roxor · August 4, 2024, 4:19pm

If the ext4 journalling process (jbd2) can’t talk to the drive (sdb) for 2 minutes at a time: the drive has physical problems like bad blocks it’s trying to reallocate, or there are cabling problems (but those typically have ‘bus reset’-type errors in dmesg too). So the errors you’re seeing in Storj are just a side-effect.

(I guess it could be something in-between: like a SATA controller or SAS HBA that’s causing the hang: but it seems less likely: but if it is it could be those components running too hot and getting flakey - a fan pointed in the right place could help)

At a minimum I’d unmount the partition and “fsck -y” it: I bet you’ll see it repair errors. If it keeps happening (or gets worse) then use something like ddrescue to pour the entirety of the problem drive into a different one.

arrogantrabbit · August 4, 2024, 5:07pm

It shall be in /var/log/messages.

32GB should be sufficient to handle any load we saw so far, provided you have made all the recommended configuration changes (sync off, atime off, use L2ARC or add special device for metadata, etc). One of my nodes is still running on supermicro X9 system with 32GB ram, and a single 4-disk raidz1 plus special device mirror - HDD has been 60% busy at the peak load. I have never seen any process getting killed.

Maybe a possibility is your have restarted the server and the process was killed as a part of shutdown? This will depend on how are you running the node. In jail? With daemon? Do you handle SIGTERM? Do you give it enough time to exit before sending SIGKILL?

Storgeez · August 4, 2024, 8:06pm

Sync is off, atime is off, I don’t have L2ARC or metadata vdev, but that is a luxury. I have also not seen 100% usage during peak. Server has 55 days uptime, no changes in this time except updating to 1.108 and then 1.109 releases.

I can try increasing the timeout time as @TWH suggested and see if that works around the issue, I feel it might, but currently filewalker is running, I’m going to let it run (unless it exits again) and see if that fixes my storage stats as they are wildly wrong at the moment. Then try this workaround.

nyancodex · September 15, 2024, 3:23pm

I also have this error

2024-09-15T15:20:49Z    ERROR   piecestore      upload internal error   {"Process": "storagenode", "error": "manager closed: unexpected EOF", "errorVerbose": "manager closed: unexpected EOF\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:235"}
2024-09-15T15:20:49Z    ERROR   piecestore      upload failed   {"Process": "storagenode", "Piece ID": "GFLTBHVUQTJUHYMQONIVXN3HWNV6VPKY34NA4H2V3HR5Q3GI72GQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT_REPAIR", "Remote Address": "5.161.118.173:38691", "Size": 0, "error": "manager closed: unexpected EOF", "errorVerbose": "manager closed: unexpected EOF\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:235"}

no panic, no node crash but the upload is very low, any update?

Alexey · September 16, 2024, 4:52am

This node is slower than competitors.
The usage depends on the customers, their activity is not predictable.
Tests are currently stopped accordingly graphs for the Saltlake satellite on my nodes. They didn’t have much data from it though, most of usage from US1 customers anyway.