I’ve been running a node for about 3 years without any issues, and suddenly, it’s just stopped working. It goes down, I restart it, and almost immediately it goes down again.
Some info:
Using Windows 10, static IP, wired ethernet connection, version: v1.105.4. Using a Westen Digital 16TB HDD (bought brand new when I started the node). Port is still open.
Logs uploaded to my Google Drive here showing it failing several times.
I’m concerned that all the hard work of having it running consistently is about to be lost and I don’t know what to do!
I only make about $15 per month from this so it doesn’t make economic sense to spend hours debugging this!
Can anyone please help??
Here are some lines from the end of the log:
2024-06-23T21:32:00+01:00 INFO lazyfilewalker.used-space-filewalker starting subprocess {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-06-23T21:32:00+01:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"}
2024-06-23T21:32:00+01:00 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-06-23T21:32:00+01:00 ERROR piecestore:cache error getting current used space: {"error": "filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled", "errorVerbose": "group:\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-06-23T21:32:00+01:00 ERROR failure during run {"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-06-23T21:32:00+01:00 FATAL Unrecoverable error {"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
Essentially, your drive can’t keep up the speed you storagenode needs. I think this is a temporary thing, because they’re changing the way in which uploads are being divides over nodes. For now, I would consider defragmenting the drive and change the timeouts in the config for a while.
The fact is a 16TB drive, in each case leaves out the option of very subpar hardware such as SMR. But this is a problem of the evolvement which STORJ is going through at the moment.
Don’t be too afraid of being offline for some days. You need to be 12 days offline to get suspended and about a month to get disqualified.
Put another way, why don’t they make this the default?
Also, I’ve just checked and I don’t have these lines specifically. I’ve added them, but how can I be sure these are correct for my setup/are compatible?
I have some other lines in there which mention timeout, should I do anything with them instead?
Just want to make sure I don’t mess it up!
Here are the lines I have:
# timeout for dialing satellite during sending orders
# storage2.orders.sender-dial-timeout: 1m0s
# duration between sending
# storage2.orders.sender-interval: 1h0m0s
# timeout for sending
# storage2.orders.sender-timeout: 1h0m0s
# allows for small differences in the satellite and storagenode clocks
# storage2.retain-time-buffer: 48h0m0s
# how long to spend waiting for a stream operation before canceling
# storage2.stream-operation-timeout: 30m0s
There is more usage of the Storj-network, so more concurrency your drive has to handle.
Any other concurrency added to it, increases your problems. So indeed: disable virus scanners and so on for this drive.
Also search index service should be disabled.
Thank you all so much!! Currently defragging and removing the index!
Regarding the changes in the config file - adding the wait milliseconds as bandaid - does this mean I need to keep an eye out for when a release happens that fixes the need for this, so that I can remove it? Or is that not necessary?
Please note - these checks are tests, and increasing their timeout just mean that check would be less effective in the detecting of a real problem - the slow or stalling (dying) disk.
Increasing each of these timeouts for the checks will increase a risk of an undetectable hangs or hardware failures and the node could be disqualified for failing audits.
For example you forced to increase a readable check timeout up to 5 minutes to stop crashes. But it’s also mean that your node would be unable to provide a piece to the customer or to the auditor for the same 5 minutes. And if the node was unable to provide a piece for audit 3 times with a 5 minutes timeout each, this audit will be considered as failed.
If you forced to specify a higher timeout for a writeability check, then this is mean that the node cannot accept pieces from the customers fast enough too, so the success rate will be low, it would have a lower usage and a lower payout.
So I wouldn’t recommend to change these timeouts too much, use 30s steps until the node would not stop anymore. However, if you reached 5 minutes for any of them, your disk likely have bigger issues than just node’s crashes.
Because it’s not expected that disk cannot write a small file even after a minute.
It’s better to do not keep it’s too high as explained above. So, when you finish a defragmentation, you may try to comment out them, save the config and restart the node, then monitor it.
You may also tune the filesystem a little bit more:
Thanks so much for the comprehensive answer, I really appreciate it!! All makes a lot more sense now, thank you!
I can also confirm that the 8dot3name has been disabled all along, and I’ve also disabled the ‘last accessed’ as per your other helpful suggestion too!
One thing I also saw on one of those posts:
And if you haven’t done so yet – move databases to your system drive.
Would you be able to kindly advise how I might do this? Thanks!