I just figured out that the storagenode process exited, but somehow it was still using storage with usage pattern similar to filewalker. Restarted all that, now weâll see if that was the issue. Should be no hardware issues on that node. There was no fatal error the first time, but it mightâve gotten buried by thousands of lines of trash outputted on stdout so I mightâve not seen it. Second time there was a fatal error due to non-writable storage so I went to check the stats. Never seen this before, everything else is normal, might be some kind of hiccup after 1.5 months uptime.
Itâs possible, if the lazy mode is enabled (itâs enabled by default), almost all filewalkers are spawned as a separate processes.
If you use docker, you may see all processes with this command:
Sadly I didnât check after that specific exit. Happened again, something is causing this issue, it wasnât happening before. Iâve updated the node to 1.109.2, will see if manifests againâŚ
Nope, node exited again with thousands of lines of code in stdout and last log excerpt has a long list of âerror sending hash and order limitâ and âdownload failedâ with write: broken pipe errors. The node has stayed running even though the command has exited. I have 2 storagenode processes remaining running, utilizing 60% of the disk and 1.3 GB and 1.75 GB of RAM. There are no further log entries after the command exited. Restarting the node works normally. No fatal errors in the last million lines of log that I could find.
Do you have an Unrecoverable/FATAL errors or âshutdownâ log lines before that happen?
Iâm sure that the root cause is not the canceled upload, It is likely that upload cancellations and other failures are a result of the process being terminated.
However, lazy filewalkers may remain, and this is likely a bug.
i checked for fatal in the last million lines when it happened, didnât check for shutdown or unrecoverable words. Will check that if it happens the same way again. But basically it looked like it entered some kind of perpetual error state that it wouldnât get out of for a long time. That node has been quiet for a long time, it only started after I updated to v1.108, unless itâs a coincidence.
Got a non-writable storage error again, storagenode exited, filewalker or something stayed running again after the main process exited.
Log content:
024-08-04T10:32:30+02:00 INFO lazyfilewalker.gc-filewalker lazyfilewalker/process.go:81 subprocess exited with status {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "status": -1, "error": "signal: killed"}
2024-08-04T10:32:30+02:00 ERROR pieces pieces/store.go:569 lazyfilewalker failed {"Process": "storagenode", "error": "lazyfilewalker: signal: killed", "errorVerbose": "lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkSatellitePiecesToTrash:160\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:565\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:379\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:265\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash
/go/src/storj.io/storj/storagenode/pieces/store.go:569
storj.io/storj/storagenode/retain.(*Service).retainPieces
/go/src/storj.io/storj/storagenode/retain/retain.go:379
storj.io/storj/storagenode/retain.(*Service).Run.func2
/go/src/storj.io/storj/storagenode/retain/retain.go:265
golang.org/x/sync/errgroup.(*Group).Go.func1
/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
2024-08-04T10:32:30+02:00 ERROR filewalker pieces/filewalker.go:156 failed to get progress from database {"Process": "storagenode"}
storj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash
/go/src/storj.io/storj/storagenode/pieces/filewalker.go:156
storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash
/go/src/storj.io/storj/storagenode/pieces/store.go:572
storj.io/storj/storagenode/retain.(*Service).retainPieces
/go/src/storj.io/storj/storagenode/retain/retain.go:379
storj.io/storj/storagenode/retain.(*Service).Run.func2
/go/src/storj.io/storj/storagenode/retain/retain.go:265
golang.org/x/sync/errgroup.(*Group).Go.func1
/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
2024-08-04T10:32:30+02:00 DEBUG db filestore/dir.go:877 number of subdirs {"Process": "storagenode", "count": 1024}
2024-08-04T10:32:30+02:00 ERROR retain retain/retain.go:267 retain pieces failed {"Process": "storagenode", "cachePath": "/Storj/Storj-v3-data/retain", "error": "retain: filewalker: context canceled", "errorVerbose": "retain: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:181\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:572\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:379\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:265\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj.io/storj/storagenode/retain.(*Service).Run.func2
/go/src/storj.io/storj/storagenode/retain/retain.go:267
golang.org/x/sync/errgroup.(*Group).Go.func1
/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
2024-08-04T10:32:31+02:00 ERROR storagenode/cmd_run.go:50 failure during run {"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
main.cmdRun.func1
/go/src/storj.io/storj/cmd/storagenode/cmd_run.go:50
main.cmdRun
/go/src/storj.io/storj/cmd/storagenode/cmd_run.go:128
main.newRunCmd.func1
/go/src/storj.io/storj/cmd/storagenode/cmd_run.go:33
storj.io/common/process.cleanup.func1.4
/go/pkg/mod/storj.io/common@v0.0.0-20240711155019-3bfa00555ba3/process/exec_conf.go:392
storj.io/common/process.cleanup.func1
/go/pkg/mod/storj.io/common@v0.0.0-20240711155019-3bfa00555ba3/process/exec_conf.go:410
github.com/spf13/cobra.(*Command).execute
/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983
github.com/spf13/cobra.(*Command).ExecuteC
/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115
github.com/spf13/cobra.(*Command).Execute
/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039
storj.io/common/process.ExecWithCustomOptions
/go/pkg/mod/storj.io/common@v0.0.0-20240711155019-3bfa00555ba3/process/exec_conf.go:112
main.main
/go/src/storj.io/storj/cmd/storagenode/main.go:34
runtime.main
/usr/local/go/src/runtime/proc.go:271
2024-08-04T10:32:31+02:00 FATAL process/exec_conf.go:428 Unrecoverable error {"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
storj.io/common/process.cleanup.func1
/go/pkg/mod/storj.io/common@v0.0.0-20240711155019-3bfa00555ba3/process/exec_conf.go:428
github.com/spf13/cobra.(*Command).execute
/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983
github.com/spf13/cobra.(*Command).ExecuteC
/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115
github.com/spf13/cobra.(*Command).Execute
/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039
storj.io/common/process.ExecWithCustomOptions
/go/pkg/mod/storj.io/common@v0.0.0-20240711155019-3bfa00555ba3/process/exec_conf.go:112
main.main
/go/src/storj.io/storj/cmd/storagenode/main.go:34
runtime.main
/usr/local/go/src/runtime/proc.go:271
The node keeps failing before finishing the initial filewalk, disk has been at 90%-100% continuously for 2-3 days, maybe some kind of hiccup of either the storagenode or storage under long sustained load. It has overshot the assigned storage by nearly 5 TiB, I have now reduced the assigned storage space to stop uploads. I donât know how it overshot the alotted space so much.
Whatâs a beautiful error. Do you know, why the OS killed it? Seems OOM?
In that specific case I would like to suggest to disable the lazy mode and restart the node. I would hope that it would allow the filewalker to finish the scan.
And very interesting to see, whatâs OS, how much RAM and how many nodes here.
Thatâs easy one⌠Since filewalkers cannot finish the scan, the databases are not updated with the actual usage, so the node reports to the satellites that itâs still has a free space (where is not).
It will stop uploads only when there would be less than 5GB either in the allocation or on the diskâŚ
OS is FreeBSD, 32GB RAM, 1 node. I donât know why it got killed, is there a place to check for that? I disabled the lazy mode filewalker, will see if that makes it finish quickly. Didnât realize lazy mode was set as default.
I understand the used space is wrong because file scan didnât finish but I donât know how it got to that state, it used to be within 1-2 TB⌠I think things shifted during that heavy ingress, I wasnât paying that much attention.
This usually happen, if the disk subsystem is slow for some reason.
However, I didnât run FreeBSD for a long time, so cannot remember, where to search for the killing processes messages⌠They should be in the system journals, but I do not remember where they are on FreeBSD, sorryâŚ
I would like to summon @arrogantrabbit, who still uses FreeBSD until now.
I donât know if his is related to your issues, but i have noticed that after updating to version 1.108.3 and 1.109.2 that my node seems to be crashing randomly. I am getting âverifying writabilityâ errors. I see in the dmesg that the system is getting a hung_task error. I increased the verify-dir-writable-timeout in my nodeâs config.yaml and that seems to have stopped the crashes. The node is a 8 core, 12GB RAM, Debian VM, using a mdadm array for blob storage and an SSD for DBâs. The mdadm array reports no errors. The drives are only two years old and were fully tested before put into operation. Maybe a drive in my array is going bad and this is just a coincidence, i also updated docker around this point in time. On a side note, it would be nice if we could have an option to limit the minimum time between successful used space filewalkes to once a week, my node updated like 3 times in a week and each time its running all the filewalkers again, this is not good for the health of the drives.
Storj Node Errorâs:
root@Storj:# docker logs storagenode 2>&1 | grep -E "verifying writability"
2024-08-01T10:32:22Z ERROR services unexpected shutdown of a runner {"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-08-01T10:32:29Z ERROR failure during run {"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
Error: piecestore monitor: timed out after 1m0s while verifying writability of storage directory
2024-08-01T14:38:51Z ERROR services unexpected shutdown of a runner {"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-08-01T14:39:03Z ERROR failure during run {"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
Error: piecestore monitor: timed out after 1m0s while verifying writability of storage directory
Dmesg Errors:
[101378.039043] INFO: task jbd2/sdb-8:464 blocked for more than 120 seconds.
[101378.039128] Not tainted 5.10.0-30-amd64 #1 Debian 5.10.218-1
[101378.039155] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[101378.039199] task:jbd2/sdb-8 state:D stack: 0 pid: 464 ppid: 2 flags:0x00004000
[101378.039225] Call Trace:
[101378.039300] __schedule+0x282/0x870
[101378.039340] ? bit_wait+0x60/0x60
[101378.039344] schedule+0x46/0xb0
[101378.039349] io_schedule+0x42/0x70
[101378.039353] bit_wait_io+0xd/0x60
[101378.039357] __wait_on_bit+0x2a/0x90
[101378.039362] out_of_line_wait_on_bit+0x92/0xb0
[101378.039371] ? var_wake_function+0x30/0x30
[101378.039452] jbd2_journal_commit_transaction+0x16ac/0x1ad0 [jbd2]
[101378.039482] kjournald2+0xab/0x270 [jbd2]
[101378.039494] ? add_wait_queue_exclusive+0x70/0x70
[101378.039503] ? load_superblock.part.0+0xb0/0xb0 [jbd2]
[101378.039509] kthread+0x11b/0x140
[101378.039514] ? __kthread_bind_mask+0x60/0x60
[101378.039520] ret_from_fork+0x22/0x30
[101619.700143] INFO: task jbd2/sdb-8:464 blocked for more than 120 seconds.
[101619.700222] Not tainted 5.10.0-30-amd64 #1 Debian 5.10.218-1
[101619.700249] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[101619.700294] task:jbd2/sdb-8 state:D stack: 0 pid: 464 ppid: 2 flags:0x00004000
[101619.700319] Call Trace:
[101619.700382] __schedule+0x282/0x870
[101619.700660] ? bit_wait+0x60/0x60
[101619.700670] schedule+0x46/0xb0
[101619.700675] io_schedule+0x42/0x70
[101619.700679] bit_wait_io+0xd/0x60
[101619.700683] __wait_on_bit+0x2a/0x90
[101619.700688] out_of_line_wait_on_bit+0x92/0xb0
[101619.700697] ? var_wake_function+0x30/0x30
[101619.700786] jbd2_journal_commit_transaction+0x16ac/0x1ad0 [jbd2]
[101619.700817] kjournald2+0xab/0x270 [jbd2]
[101619.700837] ? add_wait_queue_exclusive+0x70/0x70
[101619.700857] ? load_superblock.part.0+0xb0/0xb0 [jbd2]
[101619.700873] kthread+0x11b/0x140
[101619.700881] ? __kthread_bind_mask+0x60/0x60
[101619.700892] ret_from_fork+0x22/0x30
[101718.040308] docker0: port 1(veth88981f3) entered disabled state
[101718.040452] veth23f48ce: renamed from eth0
[101718.099849] docker0: port 1(veth88981f3) entered disabled state
[101718.100834] device veth88981f3 left promiscuous mode
[101718.100846] docker0: port 1(veth88981f3) entered disabled state
[101718.224335] docker0: port 1(vetha26b4df) entered blocking state
[101718.224342] docker0: port 1(vetha26b4df) entered disabled state
[101718.224485] device vetha26b4df entered promiscuous mode
[101718.224701] docker0: port 1(vetha26b4df) entered blocking state
[101718.224706] docker0: port 1(vetha26b4df) entered forwarding state
[101719.004446] eth0: renamed from veth1f28bd7
[101719.042750] IPv6: ADDRCONF(NETDEV_CHANGE): vetha26b4df: link becomes ready
[104157.148546] INFO: task jbd2/sdb-8:464 blocked for more than 120 seconds.
[104157.148632] Not tainted 5.10.0-30-amd64 #1 Debian 5.10.218-1
[104157.148664] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[104157.148724] task:jbd2/sdb-8 state:D stack: 0 pid: 464 ppid: 2 flags:0x00004000
[104157.148763] Call Trace:
[104157.148880] __schedule+0x282/0x870
[104157.148995] ? bit_wait+0x60/0x60
[104157.149003] schedule+0x46/0xb0
[104157.149010] io_schedule+0x42/0x70
[104157.149015] bit_wait_io+0xd/0x60
[104157.149022] __wait_on_bit+0x2a/0x90
[104157.149028] out_of_line_wait_on_bit+0x92/0xb0
[104157.149039] ? var_wake_function+0x30/0x30
[104157.149172] jbd2_journal_commit_transaction+0x16ac/0x1ad0 [jbd2]
[104157.149222] kjournald2+0xab/0x270 [jbd2]
[104157.149245] ? add_wait_queue_exclusive+0x70/0x70
[104157.149258] ? load_superblock.part.0+0xb0/0xb0 [jbd2]
[104157.149266] kthread+0x11b/0x140
[104157.149274] ? __kthread_bind_mask+0x60/0x60
[104157.149284] ret_from_fork+0x22/0x30
Relevant Changes To config.yaml:
# Leave the interval as is until you increase a timeout more than interval.
#storage2.monitor.verify-dir-readable-timeout: 1m30s
#storage2.monitor.verify-dir-readable-interval: 1m30s
storage2.monitor.verify-dir-writable-timeout: 1m30s
#storage2.monitor.verify-dir-writable-interval: 5m0s
If the ext4 journalling process (jbd2) canât talk to the drive (sdb) for 2 minutes at a time: the drive has physical problems like bad blocks itâs trying to reallocate, or there are cabling problems (but those typically have âbus resetâ-type errors in dmesg too). So the errors youâre seeing in Storj are just a side-effect.
(I guess it could be something in-between: like a SATA controller or SAS HBA thatâs causing the hang: but it seems less likely: but if it is it could be those components running too hot and getting flakey - a fan pointed in the right place could help)
At a minimum Iâd unmount the partition and âfsck -yâ it: I bet youâll see it repair errors. If it keeps happening (or gets worse) then use something like ddrescue to pour the entirety of the problem drive into a different one.
32GB should be sufficient to handle any load we saw so far, provided you have made all the recommended configuration changes (sync off, atime off, use L2ARC or add special device for metadata, etc). One of my nodes is still running on supermicro X9 system with 32GB ram, and a single 4-disk raidz1 plus special device mirror - HDD has been 60% busy at the peak load. I have never seen any process getting killed.
Maybe a possibility is your have restarted the server and the process was killed as a part of shutdown? This will depend on how are you running the node. In jail? With daemon? Do you handle SIGTERM? Do you give it enough time to exit before sending SIGKILL?
Sync is off, atime is off, I donât have L2ARC or metadata vdev, but that is a luxury. I have also not seen 100% usage during peak. Server has 55 days uptime, no changes in this time except updating to 1.108 and then 1.109 releases.
I can try increasing the timeout time as @TWH suggested and see if that works around the issue, I feel it might, but currently filewalker is running, Iâm going to let it run (unless it exits again) and see if that fixes my storage stats as they are wildly wrong at the moment. Then try this workaround.
This node is slower than competitors.
The usage depends on the customers, their activity is not predictable.
Tests are currently stopped accordingly graphs for the Saltlake satellite on my nodes. They didnât have much data from it though, most of usage from US1 customers anyway.