Node suddenly failing

lookitsbenfb · June 24, 2024, 8:52am

Hey,

I’ve been running a node for about 3 years without any issues, and suddenly, it’s just stopped working. It goes down, I restart it, and almost immediately it goes down again.

Some info:

Using Windows 10, static IP, wired ethernet connection, version: v1.105.4. Using a Westen Digital 16TB HDD (bought brand new when I started the node). Port is still open.

Logs uploaded to my Google Drive here showing it failing several times.

I’m concerned that all the hard work of having it running consistently is about to be lost and I don’t know what to do!

I only make about $15 per month from this so it doesn’t make economic sense to spend hours debugging this!

Can anyone please help??

Here are some lines from the end of the log:

2024-06-23T21:32:00+01:00	INFO	lazyfilewalker.used-space-filewalker	starting subprocess	{"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-06-23T21:32:00+01:00	ERROR	lazyfilewalker.used-space-filewalker	failed to start subprocess	{"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"}
2024-06-23T21:32:00+01:00	ERROR	pieces	failed to lazywalk space used by satellite	{"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-06-23T21:32:00+01:00	ERROR	piecestore:cache	error getting current used space: 	{"error": "filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled", "errorVerbose": "group:\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-06-23T21:32:00+01:00	ERROR	failure during run	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-06-23T21:32:00+01:00	FATAL	Unrecoverable error	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

JWvdV · June 24, 2024, 9:16am

Essentially, your drive can’t keep up the speed you storagenode needs. I think this is a temporary thing, because they’re changing the way in which uploads are being divides over nodes. For now, I would consider defragmenting the drive and change the timeouts in the config for a while.

The fact is a 16TB drive, in each case leaves out the option of very subpar hardware such as SMR. But this is a problem of the evolvement which STORJ is going through at the moment.

Don’t be too afraid of being offline for some days. You need to be 12 days offline to get suspended and about a month to get disqualified.

lookitsbenfb · June 24, 2024, 9:20am

Thank you!! Is there any chance you can tell me specifically what changes to make in my config?

Sorry - I’m not super proficient with this stuff, unfortunately! So, if you’re able to be super specific, I’d really appreciate it!!

JWvdV · June 24, 2024, 9:23am

# storage2.monitor.verify-dir-readable-timeout: 1m0s
# storage2.monitor.verify-dir-writable-timeout: 1m0s

Remove the comment sign and change it to 5m0s or something.

The problem is the concurrency which the node can’t keep up with, because it’s a lot of random IO.

jolmando · June 24, 2024, 9:30am

if your node can no longer cope. Just limit his Internet speed

Vadim · June 24, 2024, 9:30am

Please check in windows that this drive is excluded from Antivirus, not used for page files.

lookitsbenfb · June 24, 2024, 9:50am

Thank you! Are you able to kindly share how I check this?

But first, if it’s been working successfully for 3 years, isn’t it unlikely that this has suddenly become an issue?

lookitsbenfb · June 24, 2024, 9:51am

Thank you!! Super helpful!

Is there any downside to setting it to this?

Put another way, why don’t they make this the default?

Also, I’ve just checked and I don’t have these lines specifically. I’ve added them, but how can I be sure these are correct for my setup/are compatible?

I have some other lines in there which mention timeout, should I do anything with them instead?

Just want to make sure I don’t mess it up!

Here are the lines I have:

# timeout for dialing satellite during sending orders
# storage2.orders.sender-dial-timeout: 1m0s

# duration between sending
# storage2.orders.sender-interval: 1h0m0s

# timeout for sending
# storage2.orders.sender-timeout: 1h0m0s

# allows for small differences in the satellite and storagenode clocks
# storage2.retain-time-buffer: 48h0m0s

# how long to spend waiting for a stream operation before canceling
# storage2.stream-operation-timeout: 30m0s

jolmando · June 24, 2024, 10:03am

probably with the fact that the load has increased. Have you not noticed the difference in the traffic you are getting now?

Vadim · June 24, 2024, 10:03am

Bigger traffic, lead to use more memory, that use more pagefiles with bigger use of data by node also, so all go as snowball.
How To Manage Virtual Memory (Pagefile) In Windows 10 | Tom’s Hardware (tomshardware.com)

JWvdV · June 24, 2024, 10:03am

No, because:

Your node has grown in the time between.
There is more usage of the Storj-network, so more concurrency your drive has to handle.

Any other concurrency added to it, increases your problems. So indeed: disable virus scanners and so on for this drive.
Also search index service should be disabled.

JWvdV · June 24, 2024, 10:05am

Because it’s a band aid, and not a real solution.

So also implement the other things suggested.

And then wait for the Storj-satellite software to be adapted.

lookitsbenfb · June 24, 2024, 10:41am

Perfect, thank you for this!

I’ve disabled the anti-virus on this drive, and I am currently defragmenting it

When you have a moment, would you mind letting me know about the config file in my other message please?

Once that’s set, I think/hope I’m all done!

Thanks so much for jumping on this so quickly!

JWvdV · June 24, 2024, 11:18am

That’s probably because of the age of your node.

And no, you shouldn’t do anything with those lines. Keep them as they are.

JWvdV · June 24, 2024, 11:20am

This for the indexing for search of a drive:

lookitsbenfb · June 24, 2024, 12:47pm

Thanks @JWvdV and @Vadim! I’ve now disabled indexing

One last question - what should I set my page file size to please? Please see screenshot.

Vadim · June 24, 2024, 12:48pm

No paging on H is ok.

lookitsbenfb · June 24, 2024, 3:22pm

Thank you all so much!! Currently defragging and removing the index!

Regarding the changes in the config file - adding the wait milliseconds as bandaid - does this mean I need to keep an eye out for when a release happens that fixes the need for this, so that I can remove it? Or is that not necessary?

Alexey · June 25, 2024, 4:44am

Hello @lookitsbenfb,
Welcome back!

Please note - these checks are tests, and increasing their timeout just mean that check would be less effective in the detecting of a real problem - the slow or stalling (dying) disk.

Increasing each of these timeouts for the checks will increase a risk of an undetectable hangs or hardware failures and the node could be disqualified for failing audits.
For example you forced to increase a readable check timeout up to 5 minutes to stop crashes. But it’s also mean that your node would be unable to provide a piece to the customer or to the auditor for the same 5 minutes. And if the node was unable to provide a piece for audit 3 times with a 5 minutes timeout each, this audit will be considered as failed.

If you forced to specify a higher timeout for a writeability check, then this is mean that the node cannot accept pieces from the customers fast enough too, so the success rate will be low, it would have a lower usage and a lower payout.

So I wouldn’t recommend to change these timeouts too much, use 30s steps until the node would not stop anymore. However, if you reached 5 minutes for any of them, your disk likely have bigger issues than just node’s crashes.

Because it’s not expected that disk cannot write a small file even after a minute.

It’s better to do not keep it’s too high as explained above. So, when you finish a defragmentation, you may try to comment out them, save the config and restart the node, then monitor it.
You may also tune the filesystem a little bit more:

and

lookitsbenfb · June 25, 2024, 7:55am

Thanks so much for the comprehensive answer, I really appreciate it!! All makes a lot more sense now, thank you!

I can also confirm that the 8dot3name has been disabled all along, and I’ve also disabled the ‘last accessed’ as per your other helpful suggestion too!

One thing I also saw on one of those posts:

And if you haven’t done so yet – move databases to your system drive.

Would you be able to kindly advise how I might do this? Thanks!