2023-10-25T20:42:51+03:00 FATAL Unrecoverable error {"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
You need to stop the storagenode service, check your disk for errors and fix them, then run defrag for this disk, then start the node.
If it would still stop because a timeout, you may try to increase it on 30s, save the config and restart the service.
no, you have 12 days. but after 4h you lose some data to the network.
BUT i would start the node after the disk check , set it to full,
(match the holding data to the available, then run defrag while node running and then reverse the setting to space free.)
UNraid pool of retired drives. Running storj in docker. Static IP. Been up for over a year. DB moved to SSD.
So, hereās the thingā¦I noticed my node was rebooting quite frequently (and randomly as i watched the logs). Pouring though forums and troubleshooting, i saw some people, having similar issues, had a DB go bad. Figured I had a bad DB (seems to happen during balances when i change a drive in the raid array formatted in this btrfs FSā¦im changing the FS soon). Which normally isnāt a massive issue, just annoying. (Sucks for record keeping, but this is a hobby, so meh.) Iāve had done this before; backing up the DB files, removing the offending db, delete, let it rebuild, put the remaining backed up DBās back! EZ! But i was too comfortable doing this and, like a dummy, i didnāt confirm the backup move part in Krusader, and just deleted all the DBāsā¦so they all rebuilt.
So now, its still restarting all the time, but now every time it restarts, it resets my āused spaceā amountā¦
When it starts, it will run normally for a while (sort of) with plenty ādownload/upload startedā, but also quite a few āU/D canceledā and the occasional error like this,
āERROR piecestore upload failed {āprocessā: āstoragenodeā, āPiece IDā: āXL36YQWSUJL4J6OERSKMBGQMVOTHAQ77NKU5ZAPH3GZ65L2K6B7Qā, āSatellite IDā: ā12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3Sā, āActionā: āPUTā, āerrorā: āmanager closed: unexpected EOFā, āerrorVerboseā: āmanager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:96\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:226ā, āSizeā: 1179648, āRemote Addressā: ā23.237.232.130:59988ā}ā
errors mixed in as well.
I have noticed a flood of of āINFO piecestore upload canceled (race lost or node shutdown)ā only logs, then the log seems to lock up, and storm (log fills up with a mix of random gibberish and log-like phrases, but all one line, no spaces for as far as the log will scroll! Like this small excerpt
āe0, 0xc00092f070?)\n\t/go/pkg/mod/storj.io/drpc@v0.0.33/drpcctx/tracker.go:35 +0x2e\ncreated by storj.io/drpc/drpcctx.(*Tracker).Run in goroutine 1135\n\t/go/pkg/mod/storj.io/drpc@v0.0.33/drpcctx/tracker.go:30 +0x79\n\ngoroutine 6719 [IO wait, 1 minutes]:\ninternal/poll.runtime_pollWait(0x1478807c96b0, 0x72)\n\t/usr/local/go/src/runtime/netpoll.go:343 +0x85\ninternal/poll.(*pollDesc).wait(0xc000a1a380?, 0xc000a12b48?, 0x0)\n\t/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27\ninternal/poll.(*pollDesc).waitRead(ā¦)\n\t/usr/local/go/src/internal/poll/fd_poll_runtime.go:89\ninternal/poll.(*FD).Read(0xc000a1a380, {0xc000a12b48, 0x4, 0x4})\nā
This nugget: āError: piecestore monitor: timed out after 1m0s while verifying writability of storage directoryā.
Then it sits there, frozen, for about 10 mins then reboots the container, and the cycle starts over!
I have made sure all ownership (root) and full read/write privilegesā are propper in all the related folders/shares! I am lost at this point. I think i need to change that FS and get that stable first before I troubleshoot the rest. I just got a bunch of drives that im testing atm, and hopefully be able to use them to do the FS change.
i am using a normal SAS 3TB disk. What is considered as normal time for such a drive? Isnāt 1.5 minutes a bit too much for waiting to write something ? I donāt have bad sectors on the disk, but it might be slow in some areas but 1.5 minutes slow ā¦ very strange.
OK one side question about that option. I want to understand what exactly it does. What will happen if i set storage2.monitor.verify-dir-writable-timeout to 5 minutes ?
I think the program Ultradefrag, showed 8% fragmentation after i ran Analyze on the disk. Isnāt that a very low percentage? I mean what do you consider as normal fragmentation? My assumption was that everything below 10% is OK. Anyway after the full CHKDSK i will run Analyze on the disk again to see the exact percentage.
Yes, but you probably have a RAID or BTRFS/ZFS or both.
It will not crash the node, even if your disk is unable to write a small file after 5 minutes. And thus would not be able to detect problems with the disk (like a bad sector or wedging disk) and your node would start to lose races for pieces much more often.
Much worse if it could also have a readability timeout, if you would be forced to increase a readability timeout more than 5 minutes, then it is an indicator of something really bad with your drive, because if your node would be unable to provide a piece for audit after 5 minutes and do so 2 more times, this audit will be considered as failed.
BTW, if you would be forced to increase a readability timeout, you also need to increase a readability interval synchronously because it has 1m0s check interval by default.
Itās better to check how many files are fragmented. If more than dozen thousands, it worth to perform it. And do not disable an automatic defragmentation for this disk.
Yes, in your case you found only writeability timeout so far in your logs. But I gave that info on case if it could also have a readability timeouts as well (very often they come together unfortunately, if the drive is slow).
Thanks for the troubleshooting help. That doesnāt really explain the āused spaceā calculation being reset every time does it? i know the upload/download errors are a symptom of other things. its really hard to watch the logs when its crashing, cuz they autoclose (by unraid design nowā¦makes sense why, just annoying)ā¦i have to figure out how to stop that nonsense.
I have added the extra 30s to the write and read timeout (I added them. Those lines wasnāt there at allā¦just the verify read/write interval settings {they show 3m and 5m respectively); but they have the pound sign in front of them as well, i believe that means ignore?? i left them as they were)ā¦now we waitā¦and iāll post the results below:
ā¦a little while laterā¦
So after adding those 2 lines to the config.yaml (and rebooting the server just for kicks)ā¦the node is still āstormingā, as im calling it, and restarting constantly. Seen some lazy filewalker errors too, but i think thats just cuz partial files are being written as it crashes.
I also just installed some drives to temporarily hold the data while i rebuild this stupid poolā¦never using BTRFS again (i lieā¦it has some good points. just too cumbersome to navigate when there are better options. im also just learning non micro$ platforms and FSās, after avoiding them since high school.)