Storj node stops after 3-4 hours

blanaru · June 10, 2024, 9:40pm

Hello!
I noticed that I cannot get rid of an weird error: one of my nodes is simply stopping the service after 3-4 hours. I reinstalled the app many times, even reinstalled windows.

I am adding the first log lines, but there are many more that would not fit here…

What can i do? I might lose the node as it ia already turning red in the suspension…

Mitsos · June 10, 2024, 10:17pm

None of the lines you are showing are “error” lines.

Paste the log file within a blockquote (5th icon from the left)

like this

Showing the actual error lines.

Alexey · June 11, 2024, 2:49am

Reinstallation never helps with nodes.
Please search for FATAL errors in your logs:

Get-Content "$env:ProgramFiles/Storj/Storage Node/storagenode.log" | sls fatal | select -last 5

blanaru · June 11, 2024, 4:52am

I hope this time I paste it correctly:

Blockquote
2024-06-10T05:28:27-07:00 FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 1m0s while
verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying
writability of storage directory\n[tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sy](http://tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sy)
nc2.(*Cycle).Run:160\n[tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Gro](http://tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Gro)
up).Go.func1:78”}
2024-06-10T07:39:45-07:00 INFO piecestore upload started {“Piece ID”:
“FOOT4WZM46MWLVAH3KU7QL6XFATALHEVMMMNNJ4F2KS2QOIPB66Q”, “Satellite ID”:
“1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “PUT”, “Remote Address”: “109.61.92.70:60750”,
“Available Space”: 2922242642988}
2024-06-10T07:39:45-07:00 INFO piecestore uploaded {“Piece ID”:
“FOOT4WZM46MWLVAH3KU7QL6XFATALHEVMMMNNJ4F2KS2QOIPB66Q”, “Satellite ID”:
“1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “PUT”, “Remote Address”: “109.61.92.70:60750”, “Size”:
249856}
2024-06-10T09:45:52-07:00 FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 1m0s while
verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying
writability of storage directory\n[tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sy](http://tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sy)
nc2.(*Cycle).Run:160\n[tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Gro](http://tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Gro)
up).Go.func1:78”}
2024-06-10T14:47:09-07:00 FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 1m0s while
verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying
writability of storage directory\n[tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sy](http://tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sy)
nc2.(*Cycle).Run:160\n[tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Gro](http://tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Gro)
up).Go.func1:78”}

JWvdV · June 11, 2024, 5:05am

Seems like you have a very slow drive or filesystem issues, which turn your drive into read-only using a re-mount. In each case, the writability check fails within 1m.

So, first run chkdsk /F {partition}.

What drive are we talking about?

blanaru · June 11, 2024, 6:51am

I finished the scan with no errors or at least this is the conclusion of the chkdsk below. I am using a seagate ironwolf 12 tb drive and it worked well until now.
One thing i need to check is that I use it as external usb drive. If there are any other things to check, will do this in the meantime and let you know.

Blockquote

PS C:\Users\Administrator> chkdsk /f d:
The type of the file system is NTFS.

Chkdsk cannot run because the volume is in use by another
process. Chkdsk may run if this volume is dismounted first.
ALL OPENED HANDLES TO THIS VOLUME WOULD THEN BE INVALID.
Would you like to force a dismount on this volume? (Y/N) y
Volume dismounted. All opened handles to this volume are now invalid.
Volume label is Pi1.

Stage 1: Examining basic file system structure …
41373184 file records processed.
File verification completed.
Phase duration (File record verification): 7.89 minutes.
47049 large file records processed.
Phase duration (Orphan file record recovery): 0.00 milliseconds.
0 bad file records processed.
Phase duration (Bad file record checking): 1.11 milliseconds.

Stage 2: Examining file name linkage …
10 reparse records processed.
41395702 index entries processed.
Index verification completed.
Phase duration (Index verification): 34.91 minutes.
0 unindexed files scanned.
Phase duration (Orphan reconnection): 19.60 seconds.
0 unindexed files recovered to lost and found.
Phase duration (Orphan recovery to lost and found): 1.58 milliseconds.
10 reparse records processed.
Phase duration (Reparse point and Object ID verification): 304.61 milliseconds.

Stage 3: Examining security descriptors …
Security descriptor verification completed.
Phase duration (Security descriptor verification): 79.90 milliseconds.
11259 data files processed.
Phase duration (Data attribute verification): 1.49 milliseconds.

Windows has scanned the file system and found no problems.
No further action is required.

11444205 MB total disk space.
7992448 MB in 41302620 files.
16354436 KB in 11261 indexes.
0 KB in bad sectors.
41802163 KB in use by the system.
65536 KB occupied by the log file.
3394964 MB available on disk.

  4096 bytes in each allocation unit.

2929716735 total allocation units on disk.
869110872 allocation units available on disk.
Total duration: 43.13 minutes (2588268 ms).

Alexey · June 11, 2024, 7:38am

You need to try to fix it, or increase the timeout for this check, see

blanaru · June 11, 2024, 8:09am

Ok guys, so I did get to the conclusion that the drive itself has a problem.
I ve checked with another drive and that one passes the first windows installation screen, as the issued drive cannot - i have to switch it off every time if I can go on with windows installation (weird).

So right now I am starting again the node on my faulty drive, copy all data to another drive and see if it will work after that. This might take 2-3 days as the drive is also running the node (dont want to get it disqualified for too much days going offline).

If you guys have any ideas that will improve this procedure, please tell me.

Thanks!

Alexey · June 12, 2024, 5:05am

There are several methods, the fastest one is to use a disk cloning utility while the node is stopped, then expand the partition and the filesystem on a new disk after move is done, the next one is to use a robocopy:

If you would go with the second method, then I would suggest to reduce the allocated space below the usage (to stop an additional ingress), disable a startup scan and restart the node, then use the guide above.
Perhaps you would need to adjust a readability check timeout and the interval, if you would hit the FATAL error related to it.

blanaru · June 12, 2024, 6:50am

Hi Alexey
Thanks! Will try this robocopy function, but how does this work when it gets to some bad logical sectors? It is freezing or go on? I am pretty sure that i have a faulty drive…
The node stop after 4 hours and i have to start manually

blanaru · June 12, 2024, 7:09am

The robocopy doesnt work. It freezes after few minutes.
Any disk cloning utility do you recommend to get this copy as fast as possible?

Alexey · June 12, 2024, 7:39am

It should retry, however, it’s not its primary function. For that you need to use a chkdsk /f /R D: (where D: is a drive letter).

dd from any Linux distributive. I’m sure there are some not free Windows analogues, I didn’t use any. If you want a GUI one specifically, then you may try to use GPart from any Linux distributive as well, including an USB-stick bootable of the popular Ubuntu distro.
But since you have a big disk, I’m not sure that’s possible to clone it, unless you have another 40TB disk… And I got a feeling, that your RAID5 is a software one… So, only Windows OS and you almost out of options.

blanaru · June 12, 2024, 7:50am

Fortunately i dont have raid set on the node. Just a 12tb drive unit that is attached externally via usb 2.0
Just to copy 41 mil files it takes more than 1000 hours, so definitely this is not gonna work :))

So any disk clone that support 12 tb drives will be better i guess
Whats your thoughts on this?

pangolin · June 12, 2024, 11:54am

Moving databases to ssd can change things a lot for systems with slow USB drives. I suggest to try this.

blanaru · June 12, 2024, 9:03pm

I am back to basics…unfortunately.
Tried few cloning disk tools and got the same error: the disk is faulty and cannot be readable at some point.
So now i started again to simply copy the huge 41 mil files that storj has gathered until now…it will take ages i guess…not to mention that everyday is getting close to be disqualified because so many offline hours.

Any advice now will be very helpful guys

Mitsos · June 12, 2024, 9:18pm

Boot a live linux distro that has ddrescue in it (or can be installed anyway).

If ddrescue can’t read that drive, then it needs to be sent to drive recovery (you don’t want to spend that much money, trust me). The more you use the damaged drive, the higher the chance it will give up. Always use a mapfile with ddrescue (it saves progress in case you restart).

blanaru · June 17, 2024, 8:16am

Guys, it is really helpful your advices!
Luckily I found out a method that works: the node is running under real capacity to not write anything on the drive (thank Alexey for this tip!) and now I just use total commander to copy ~44Mil files… It will took maybe 2-3 weeks to do that, but as long as the node is online and working, this is the safest (and maybe loongest) way of getting there

After I will make the drive swapping, I will try to do a recover, low level format something that can tell what is the root cause of this drive to behave like this.

Any suggestions about a miracle tool that can find and maybe fix what is wrong on this faulty drive? I know that if it will be a hardware issue, I will give up and throw it like freezbie

Back in the days I used the ms-dos MHDD that was very good, but unfortunately it does not support large drives…

Alexey · June 17, 2024, 8:31am

I would still recommend to use a robocopy instead, because it would copy only what’s not copied and in several threads in parallel (default for /MT: option is 8).

You need to use Linux tools, they usually free. And the ddrescue is a best one, yes, it’s not a GUI, but it would make the work done.

blanaru · June 17, 2024, 10:26am

Perfect! Thank you Alexey for everything!

Roxor · June 17, 2024, 1:06pm

So many HDD failures start slowly: in that you can fsck them a few times after errors and only really have to address things when the problems start to affect you too often. ddrescue is excellent at extracting all the useful data it can from a dying drive quickly. If it’s not totally dead… you can probably get a useful-enough disk image to migrate to a new HDD with. Great tool!