Lazy Filewalker Errors and Other Errors since trash clearance. v1.112.2

Climbingkid · September 20, 2024, 4:04pm

Since the bloom filters, and subsiquent trash clearance in early Sept, there have been no Lazy FIle walkers in the log, only those on a restart and I have seen the dashboard get out of step once more - so I restarted the node today.

Sadly, seeing lots of errors, it had been running fine all day.

2024-09-20T13:50:14Z    ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Lazy File Walker": false, "error": "filewalker: context canceled",                                                   "errorVerbose": "filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storageno                                                  de/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-09-20T13:50:14Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-09-20T13:50:14Z    INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-09-20T13:50:14Z    ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"}
2024-09-20T13:50:14Z    ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Lazy File Walker": true, "error": "lazyfilewalker: context cancele                                                  d", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:133\n\tstorj                                                  .io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:722\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-09-20T13:50:14Z    ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Lazy File Walker": false, "error": "filewalker: context canceled",                                                   "errorVerbose": "filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storageno                                                  de/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:731\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:81\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

A grep for “error” returns loads of log entries like this - unable to delete piece?

2024-09-20T15:53:19Z    ERROR   piecestore      upload internal error   {"Process": "storagenode", "error": "manager closed: unexpected EOF", "errorVerbose": "manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).read:68\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:113\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:230"}
2024-09-20T15:53:19Z    ERROR   piecestore      upload failed   {"Process": "storagenode", "Piece ID": "XMYVWMSWKXGNLOTJIXHDF3IOKEBNVH2D3M4FV7L5WV6GKAGVDMKQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Remote Address": "79.127.219.35:45062", "Size": 2162688, "error": "manager closed: unexpected EOF", "errorVerbose": "manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).read:68\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:113\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:230"}

Logs were clean prior to the bloom filters and trash clearance at end of Aug

Running inside Truenas Scale - ZFS multidrive RaidZ2 with no failures and clear drive scrubs. No errors in Truenas logs, other VMs reporting no issues.

Any help much appreciated

Thanks
CC

EasyRhino · September 20, 2024, 4:34pm

the default answer for the lazy filewalkers failing with a “context canceled” message is that it’s due to slow disk performance. And try disabling the lazy walker and restarting with a conventional non lazy walker. It usually works better.

I see the error a lot and I’m convinced the node code is being too picky about cancelling the lazy filewalkers, but I’m not smart enough to know what the threshold is.

the “manager closed: unexpected EOF” are routine. That’s the upload or download being canceled by the client or by the satellite (because it was already finished with other nodes). If they are only sporadic it’s not a problem. If you get it with EVERY ATTEMPT then there is a problem and it would spam your log… probably shortly before the node crashes

also you say you are running raid z2 with your node. If the disks are ONLY for storj it’s generally more optimal to run one node per disk. You don’t need redundancy so it’s just wasted capacity, and have independent nodes+disks actually lets you serve more IOPs.

Climbingkid · September 20, 2024, 4:58pm

Thanks @EasyRhino for teh quick reply

Its been running on the same setup for a while, never seen these errors before with the default filewalker - disk access is very fast, loads of memory for ARC cache and second level SSD cache also. Read/write performance is very fast, so its odd its complaining about that.

I will find out how to change teh filewalker and restart as you say. I just looked and all the filewalkers have completed.

Hang on - the node must have restarted, says uptime only a few hours. The errors must have been where file walkers were killed? Only the end of the log looks like a restart - with filewalkers starting and ending normally a few hours later, with no errors in between. WIll trawl the logs

This hardware is not dedicated to StorJ - it was just easy to allocate storage on existing NAS setup and is just to offset cloud backup costs.

Thanks
CC

Climbingkid · September 20, 2024, 5:07pm

So issue is an unscheduled node restart, hours into the use-space file walker - here are the entries just before the restart. “error getting current used space for trash:”

No idea what this means - system has been very stable

2024-09-20T13:50:14Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-09-20T13:50:14Z    INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-09-20T13:50:14Z    ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "error": "context canceled"}
2024-09-20T13:50:14Z    ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": true, "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/sto>
2024-09-20T13:50:14Z    ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": false, "error": "filewalker: context canceled", "errorVerbose": "filewalker: context canceled\n\tstorj.io/storj/storagenod>
2024-09-20T13:50:14Z    ERROR   piecestore:cache        encountered error while computing space used by satellite       {"Process": "storagenode", "error": "filewalker: context canceled", "errorVerbose": "filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\t>
2024-09-20T13:50:14Z    ERROR   piecestore:cache        error getting current used space for trash:     {"Process": "storagenode", "error": "filestore error: failed to walk trash namespace 7b2de9d72c2e935f1918c058caaf8ed00f0581639008707317ff1bd000000000: context canceled", "errorVerbose": "filestore error: fai>
2024-09-20T13:50:30Z    INFO    Configuration loaded    {"Process": "storagenode", "Location": "/app/config/config.yaml"}
2024-09-20T13:50:30Z    INFO    Anonymized tracing enabled      {"Process": "storagenode"}


Thanks
CC

EasyRhino · September 20, 2024, 7:47pm

is your version newer? you may have kicked off an auto update and that threw the errors while shutting down.

you could always restart your node and then in many hours see if the used space filewalkers all finish okay.

if they DON’T finish then change to non-lazy. (modify the config.yaml file, change a ‘lazy’ option to fase)

Alexey · September 21, 2024, 4:32am

As said earlier the “context canceled” for the lazy filewalker means that the disk responsiveness is far from ideal. However, it may be also a consequence of the killing the node’s process, so please search for errors like Unrecoverable and/or FATAL somewhere earlier in the log.

for the TrueNAS node it could be easier to set the environment variable instead, i.e.

STORJ_PIECES_ENABLE_LAZY_FILEWALKER=false

then I would also suggest to enable a badger cache too

STORJ_PIECES_FILE_STAT_CACHE=badger

FYI

      --pieces.enable-lazy-filewalker                            run garbage collection and used-space calculation filewalkers as a separate subprocess with lower IO priority (default true)
      --pieces.file-stat-cache string                            optional type of file stat cache. Might be useful for slow disk and limited memory. Available options: badger (EXPERIMENTAL)

Climbingkid · September 24, 2024, 8:06am

Just wanted to highlight that the likley cause of these errors was indeed a version upgrade and forced restart of the node shortly after a restart while running the use-space-filewakers, which killed off the processes. Node has been up and stable for 90h

Thanks for the help

CC