Fatal Error on my Node / Timeout after 1 min

jammerdan · April 6, 2023, 9:19am

But when the check is not successful the node software tries to shutdown the node? Or what does it do? Otherwise the check would not make sense.

vladro · April 6, 2023, 1:21pm

funny thing- i am getting this error on several nodes after updating to 1.74.x

2023-04-06T15:28:58.565+0300 FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:150\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:146\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

At the same time i have reconfigured yaml that way:

storage2.monitor.verify-dir-readable-interval: 5m0s

storage2.monitor.verify-dir-writable-interval: 10m0s

so it seems for me that configuration string is ignored and default value is used
@Alexey

agente · April 6, 2023, 3:58pm

Idem. Changed storage2.monitor.verify-dir-(read and write) and still receive errors in logs with default settings (1m0s).
Windows 11 node.

Taconode · April 6, 2023, 6:26pm

Same here I change the .verfiy-dir but still showing errors, I have a good set up I7 9700k, dedicated HDD with 600 GB of free margin and 32 GB of RAM the change on verify get dow the 100% use on the HDD disk but no some errors still showing

snorkel · April 6, 2023, 7:10pm

I apologise for stating the obvious, but did you restart the node after config change?

vladro · April 6, 2023, 8:52pm

sure! and btw after it stop after fatal error- it will be restarted anyway ^)))

daki82 · April 7, 2023, 8:07am

since i changed not much(auto restart service when fatal ~ every couple of hours) here the status of my node:

going slightly over 6 of 10 tb used, but then somehow goes back to 6 TB used.
trash around 183GB
egess and ingress normal ~90gb/day

suspension sometimes drops to 95%; not on all satelites,

online 98,4 lowest, slowly declining ~0.1/10 days

audit still 100%

hope he makes it to the end of the month, so online goes up alot.

uptimerobot noticed sometimes long response times before too, but that can be everything imo.

tditte · April 7, 2023, 10:44am

Hello, the node also had the error mentioned above, the service was terminated every 6-8 hours. After I had ruled out everything (hard drive, internet line, router, etc.)
Have I replaced my hardware (mainboard) Was planned anyway. From mini itx SOC with dual core Celeron 8GB Ram to a mini itx SOC with quad core Pentium and 32GB Ram. SSD with Windows10 and Storj HDD simply plugged in and lo and behold the node is running again without crashes. Unfortunately, I don’t know what the direct error is or was. But maybe this will help to find a solution.

Taconode · April 7, 2023, 4:29pm

storage2.monitor.verify-dir-readable-timeout: 1m30s Or storage2.monitor.verify-dir-writable-timeout: 1m30s this work for fatal error timeout but still ERROR piecestore download failed showing

daki82 · April 7, 2023, 7:22pm

maybe the fatal error comes back afer your node fills up. how full was it again?

i have now
(not just excuded the drive from indexing file contents; like suggested in the best practise windows setup)
i also disabled windows search indexer completely. maybe he gets the drive into trouble indexing those tens of thousands pieces.
no difference

Alexey · April 8, 2023, 3:19am

It shutdown the node:

Hogan1337:

2023-03-27T05:11:25.916+0200 ERROR services unexpected shutdown of a runner|{name: piecestore:monitor, error: piecestore monitor: timed out after 1m0s while verifying writability of storage directory, errorVerbose: piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:150\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:146\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75}

Alexey · April 8, 2023, 3:21am

@vladro @agente @Taconode
verify interval != timeout interval
see

Alexey · April 8, 2023, 3:25am

Hello @tditte,
Welcome to the forum!

Alexey · April 8, 2023, 3:35am

usually it happens as a result of a long tail cancelation:

when the customer wants to upload a file, they requests 110 nodes from the satellite and starts upload in parallel, as soon as they uploaded 80, they will cancel all remaining uploads. You will see an error piece upload filed
when the customer wants to download a file, they will request 39 nodes containing the required pieces and start downloading in parallel, as soon as they download 29 (they need only 29 pieces to reconstruct the file), they will cancel all remaining downloads. You will see an error piece download failed

So, in other words, your node lose the race to other nodes.
However, your node cannot win all pieces, because your node cannot be close to all customers in the world.

snorkel · April 8, 2023, 5:07am

The best explanation ever! Finnaly I understand how it works… but… if there are 80 pieces stored, why the client only asks for 39? I understand that there are copies, and dosen’t need them all, but I don’t understand why asking just for those and not all and get the fastest 29. How dose it know which are the closest 39 from all those 80?

Alexey · April 8, 2023, 5:51am

Because it needs only any 29 pieces from 80 to reconstruct a segment. This is our expansion factor for erasure codes 80/29.
It asks for 39 (not 29) to get required 29 faster.

they are not copies, each piece is unique, but any 29 from 80 is enough to reconstruct the segment, this is a math beauty of erasure codes.

it doesn’t. They are random, to allow to participate any 39 from 80 and then the fastest for the client will be cached for a while, however, when the customer would request again, they will get another 10 random nodes on top of cached 29.

tditte · April 8, 2023, 6:38am

Hello Alexej, my hard drive is a Seagate 12TB CMR disk and it has been the same for a year. Up until the update to version v1.75.2, everything ran smoothly without any errors. After I had the error I just exchanged my mainboard for another one. the ssd with Windows and the hdd with the Storj node are the same as before. I have not changed any settings of the node and the system has been running without errors since then. So I don’t understand that with the hard drive (increase the write timeout) if that were the case then the error would have to appear again.

daki82 · April 8, 2023, 6:58am

is it noteable that there are 2x 1m timeout errors after another?

first the piecestor:monitor gets crashed out of nowhere an than the fatal error closing the node?

jammerdan · April 8, 2023, 8:44am

I have it like this sequence. The sequence is reproducible as well as docker that crashes beyond repair so that only a reboot helps.

ERROR	services	unexpected shutdown of a runner	{"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:150\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:146\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
WARN	servers	service takes long to shutdown	{"Process": "storagenode", "name": "server"}

Followed by a really long log entry. It’s much longer than what I can post so I just copy part of it:

INFO	servers	slow shutdown	{"Process": "storagenode", "stack": "goroutine 1074 [running]:\nstorj.io/storj/private/lifecycle.(*Group).logStackTrace.func1()\n\t/go/src/storj.io/storj/private/lifecycle/group.go:107 +0x85\nsync.(*Once).doSlow(0xc0002ec0c0?, 0xc00076e680?)\n\t/usr/local/go/src/sync/once.go:74 +0xc2\nsync.(*Once).Do(...)\n\t/usr/local/go/src/sync/once.go:65\nstorj.io/storj/private/lifecycle.(*Group).logStackTrace(0xc000a2eeb8?)\n\t/go/src/storj.io/storj/private/lifecycle/group.go:104 +0x45\nstorj.io/storj/private/lifecycle.(*Group).Run.func1({0x1488540?, 0xc000c1c120?})\n\t/go/src/storj.io/storj/private/lifecycle/group.go:77 +0x2bb\nruntime/pprof.Do({0x1488968?, 0xc000626820?}, {{0xc00024b8a0?, 0xc00075db00?, 0x0?}}, 0xc000492a40)\n\t/usr/local/go/src/runtime/pprof/runtime.go:40 +0xa3\ncreated by storj.io/storj/private/lifecycle.(*Group).Run\n\t/go/src/storj.io/storj/private/lifecycle/group.go:64 +0x612\n\ngoroutine 1 [semacquire, 946 minutes]:\nsync.runtime_Semacquire(0x0?)\n\t/usr/local/go/src/runtime/sema.go:62 +0x25\nsync.(*WaitGroup).Wait(0xc00027c608?)\n\t/usr/local/go/src/sync/waitgroup.go:139 +0x52\ngolang.org/x/sync/errgroup.(*Group).Wait(0xc000492880)\n\t/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:53 +0x27\nstorj.io/storj/storagenode.(*Peer).Run(0xc000004780, {0x1488498, 0xc0001e4700})\n\t/go/src/storj.io/storj/storagenode/peer.go:895 +0x47e\nmain.cmdRun(0xfb5fa0?, {0xc0001d0870?, 0x1b32180?, 0xc0000127b0?})\n\t/go/src/storj.io/storj/cmd/storagenode/main.go:251 +0xe25\nstorj.io/private/process.cleanup.func1.4({0x1488968?, 0xc000344000})\n\t/go/pkg/mod/storj.io/private@v0.0.0-20230123202745-d3e63b336444/process/exec_conf.go:377 +0x14d\nstorj.io/private/process.cleanup.func1(0x1b32180, {0xc0001e1560, 0x0, 0x11})\n\t/go/pkg/mod/storj.io/private@v0.0.0-20230123202745-d3e63b336444/process/exec_conf.go:395 +0x1bfa\ngithub.com/spf13/cobra.(*Command).execute(0x1b32180, {0xc0001e1440, 0x11, 0x12})\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:852 +0x67c\ngithub.com/spf13/cobra.(*Command).ExecuteC(0x1b31f00)\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960 +0x39d\ngithub.com/spf13/cobra.(*Command).Execute(...)\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\nstorj.io/private/process.ExecWithCustomConfigAndLogger(0x1b31f00, 0x0?, 0x1b50a90?, 0xc0000061a0?)\n\t/go/pkg/mod/storj.io/private@v0.0.0-20230123202745-d3e63b336444/process/exec_conf.go:92 +0x191\nmain.main()\n\t/go/src/storj.io/storj/cmd/storagenode/main.go:478 +0x30\n\ngoroutine 38 [select]:\nstorj.io/monkit-jaeger.(*UDPCollector).Run(0xc0001e2300, {0x1488498, 0xc0001e4680})\n\t/go/pkg/mod/storj.io/monkit-jaeger@v0.0.0-20220915074555-d100d7589f41/udp.go:160 +0x7a8\nstorj.io/private/process.cleanup.func1.2()\n\t/go/pkg/mod/storj.io/private@v0.0.0-20230123202745-d3e63b336444/process/exec_conf.go:328 +0x25\ngolang.org/x/sync/errgroup.(*Group).Go.func1()\n\t/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75 +0x64\ncreated by golang.org/x/sync/errgroup.(*Group).Go\n\t/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:72 +0xa5\n\ngoroutine 22 [syscall, 946 minutes]:\nos/signal.signal_recv()\n\t/usr/local/go/src/runtime/sigqueue.go:152 +0x2f\nos/signal.loop()\n\t/usr/local/go/src/os/signal/signal_unix.go:23 +0x19\ncreated by os/signal.Notify.func1.1\n\t/usr/local/go/src/os/signal/signal.go:151 +0x2a\n\ngoroutine 1617211 [sync.Cond.Wait, 4 minutes]:\nsync.runtime_notifyListWait(0xc00189abf8, 0x0)\n\t/usr/local/go/src/runtime/sema.go:517 +0x14c\nsync.(*Cond).Wait(0xc0139dfc00?)\n\t/usr/local/go/src/sync/cond.go:70 +0x8c\nstorj.io/common/sync2.(*Throttle).ConsumeOrWait(0xc00189abe0, 0xe8000)\n\t/go/pkg/mod/storj.io/common@v0.0.0-20230224050842-81c8060f089b/sync2/throttle.go:50 +0x9a\nstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func6()\n\t/go/src/storj.io/storj/storagenode/piecestore/endpoint.go:708 +0x167\ngolang.org/x/sync/errgroup.

I don’t know what this is doing but after this the node is dead. And Docker is dead beyond repair.

Alexey · April 8, 2023, 9:28am

As I said multiple times, the problem was before, but was hidden because checkers have not had timeouts. So they just hangs forever instead of stop your node. In the 1.75.2 we fixed this bug and now the hardware problem become visible.
And if the replacement of the motherboard solved an issue - then it only confirms that problem was in the hardware.
Slow disk could be for different reasons:

disk itself;
OS settings and/or drivers;
the disk connector (for example, USB is known as slow and seems sometimes has latencies under the load);
not good architecture (for example, requires CPU cycles to perform I/O, because of the cheap controller);
low RAM and high usage, so almost nothing for cache;
bugs in CPU or MB (I have one, it’s crashing under Linux and work solid under Windows);