Fatal Error on my Node

Alexey · October 26, 2023, 7:32am

2023-10-25T20:42:51+03:00	FATAL	Unrecoverable error	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

You need to stop the storagenode service, check your disk for errors and fix them, then run defrag for this disk, then start the node.
If it would still stop because a timeout, you may try to increase it on 30s, save the config and restart the service.

See

zeroheat · October 26, 2023, 8:47am

if my node is down for more than 2 3 days for these operations … ? Can i get disqualified because of this ?

daki82 · October 26, 2023, 5:44pm

no, you have 12 days. but after 4h you lose some data to the network.

BUT i would start the node after the disk check , set it to full,
(match the holding data to the available, then run defrag while node running and then reverse the setting to space free.)

daki82 · October 25, 2023, 11:53pm

guess its the databases, check the logs (maybe they bloated). Tell us about your node setup and config.

if docker, remove container too.

miicar204 · October 26, 2023, 11:41pm

(i sorry this is off OP topic…)

UNraid pool of retired drives. Running storj in docker. Static IP. Been up for over a year. DB moved to SSD.

So, here’s the thing…I noticed my node was rebooting quite frequently (and randomly as i watched the logs). Pouring though forums and troubleshooting, i saw some people, having similar issues, had a DB go bad. Figured I had a bad DB (seems to happen during balances when i change a drive in the raid array formatted in this btrfs FS…im changing the FS soon). Which normally isn’t a massive issue, just annoying. (Sucks for record keeping, but this is a hobby, so meh.) I’ve had done this before; backing up the DB files, removing the offending db, delete, let it rebuild, put the remaining backed up DB’s back! EZ! But i was too comfortable doing this and, like a dummy, i didn’t confirm the backup move part in Krusader, and just deleted all the DB’s…so they all rebuilt.

So now, its still restarting all the time, but now every time it restarts, it resets my “used space” amount…

When it starts, it will run normally for a while (sort of) with plenty “download/upload started”, but also quite a few “U/D canceled” and the occasional error like this,
“ERROR piecestore upload failed {“process”: “storagenode”, “Piece ID”: “XL36YQWSUJL4J6OERSKMBGQMVOTHAQ77NKU5ZAPH3GZ65L2K6B7Q”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “error”: “manager closed: unexpected EOF”, “errorVerbose”: “manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:96\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:226”, “Size”: 1179648, “Remote Address”: “23.237.232.130:59988”}”
errors mixed in as well.
I have noticed a flood of of “INFO piecestore upload canceled (race lost or node shutdown)” only logs, then the log seems to lock up, and storm (log fills up with a mix of random gibberish and log-like phrases, but all one line, no spaces for as far as the log will scroll! Like this small excerpt
“e0, 0xc00092f070?)\n\t/go/pkg/mod/storj.io/drpc@v0.0.33/drpcctx/tracker.go:35 +0x2e\ncreated by storj.io/drpc/drpcctx.(*Tracker).Run in goroutine 1135\n\t/go/pkg/mod/storj.io/drpc@v0.0.33/drpcctx/tracker.go:30 +0x79\n\ngoroutine 6719 [IO wait, 1 minutes]:\ninternal/poll.runtime_pollWait(0x1478807c96b0, 0x72)\n\t/usr/local/go/src/runtime/netpoll.go:343 +0x85\ninternal/poll.(*pollDesc).wait(0xc000a1a380?, 0xc000a12b48?, 0x0)\n\t/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27\ninternal/poll.(*pollDesc).waitRead(…)\n\t/usr/local/go/src/internal/poll/fd_poll_runtime.go:89\ninternal/poll.(*FD).Read(0xc000a1a380, {0xc000a12b48, 0x4, 0x4})\n”
This nugget: “Error: piecestore monitor: timed out after 1m0s while verifying writability of storage directory”.
Then it sits there, frozen, for about 10 mins then reboots the container, and the cycle starts over!

I have made sure all ownership (root) and full read/write privileges’ are propper in all the related folders/shares! I am lost at this point. I think i need to change that FS and get that stable first before I troubleshoot the rest. I just got a bunch of drives that im testing atm, and hopefully be able to use them to do the FS change.

daki82 · October 27, 2023, 4:31am

Its the timeout error, check Fatal Error on my Node - #72 by Alexey

increase timeouts on docker run command, defragment the drives (moving al files will count too), set node to full while doin that.

Alexey · October 27, 2023, 4:31am

Please search for fatal errors, not upload/download errors.

this is a culprit, you need to increase this timeout

see

nerdatwork · October 27, 2023, 4:34am

Your post is not relevant to the topic of this thread.

daki82 · October 27, 2023, 4:35am

Alexey is moving them already

zeroheat · October 27, 2023, 6:32am

i am using a normal SAS 3TB disk. What is considered as normal time for such a drive? Isn’t 1.5 minutes a bit too much for waiting to write something ? I don’t have bad sectors on the disk, but it might be slow in some areas but 1.5 minutes slow … very strange.

OK one side question about that option. I want to understand what exactly it does. What will happen if i set storage2.monitor.verify-dir-writable-timeout to 5 minutes ?

zeroheat · October 27, 2023, 6:50am

I think the program Ultradefrag, showed 8% fragmentation after i ran Analyze on the disk. Isn’t that a very low percentage? I mean what do you consider as normal fragmentation? My assumption was that everything below 10% is OK. Anyway after the full CHKDSK i will run Analyze on the disk again to see the exact percentage.

Alexey · October 27, 2023, 7:22am

Yes, but you probably have a RAID or BTRFS/ZFS or both.

It will not crash the node, even if your disk is unable to write a small file after 5 minutes. And thus would not be able to detect problems with the disk (like a bad sector or wedging disk) and your node would start to lose races for pieces much more often.
Much worse if it could also have a readability timeout, if you would be forced to increase a readability timeout more than 5 minutes, then it is an indicator of something really bad with your drive, because if your node would be unable to provide a piece for audit after 5 minutes and do so 2 more times, this audit will be considered as failed.
BTW, if you would be forced to increase a readability timeout, you also need to increase a readability interval synchronously because it has 1m0s check interval by default.

Alexey · October 27, 2023, 7:28am

It’s better to check how many files are fragmented. If more than dozen thousands, it worth to perform it. And do not disable an automatic defragmentation for this disk.

zeroheat · October 27, 2023, 7:39am

But here we are talking about writability timeouts, right ?

storage2.monitor.verify-dir-writable-timeout

daki82 · October 27, 2023, 7:39am

its fucking high, its 8% from millions of files.

Alexey · October 27, 2023, 7:42am

Yes, in your case you found only writeability timeout so far in your logs. But I gave that info on case if it could also have a readability timeouts as well (very often they come together unfortunately, if the drive is slow).

zeroheat · October 27, 2023, 8:01am

No, they were definitely less than 1 million. About 250 000 i think. But let’s wait for the chkdsk to complete and i will put here some screenshots.

zeroheat · October 27, 2023, 8:02am

Thank you for that!

daki82 · October 27, 2023, 11:26am

its more about how much they are fragmented, than how many they are…

miicar204 · October 28, 2023, 2:01am

Thanks for the troubleshooting help. That doesn’t really explain the “used space” calculation being reset every time does it? i know the upload/download errors are a symptom of other things. its really hard to watch the logs when its crashing, cuz they autoclose (by unraid design now…makes sense why, just annoying)…i have to figure out how to stop that nonsense.

I have added the extra 30s to the write and read timeout (I added them. Those lines wasn’t there at all…just the verify read/write interval settings {they show 3m and 5m respectively); but they have the pound sign in front of them as well, i believe that means ignore?? i left them as they were)…now we wait…and i’ll post the results below:

…a little while later…

So after adding those 2 lines to the config.yaml (and rebooting the server just for kicks)…the node is still “storming”, as im calling it, and restarting constantly. Seen some lazy filewalker errors too, but i think thats just cuz partial files are being written as it crashes.

I also just installed some drives to temporarily hold the data while i rebuild this stupid pool…never using BTRFS again (i lie…it has some good points. just too cumbersome to navigate when there are better options. im also just learning non micro$ platforms and FS’s, after avoiding them since high school.)