Trash displays too low a value

CubeFan · February 12, 2024, 8:24am

Hello everyone,

I am currently in the process of migrating my two nodes from Windows (NTFS) to Linux (Docker) (ext4).

Since it is difficult to count such large directories in Windows, I counted the trash folder and noticed a difference of almost 240 gb.
grafik
Is there a way to manually trigger the services such as the garbage collector and filewalker etc. to get the node “clean” once during the move?

I suspect that the problems arose because I had problems with the VM and the hard disk for a while.

Thanks for your help and tips on how to get my node back on track

daki82 · February 12, 2024, 1:39pm

Sadly, i can confirm this on my fast and cached node. (NTFS on win11)

daki82 · February 12, 2024, 1:42pm

can’t confirm,i had no problems at all.

CubeFan · February 12, 2024, 1:48pm

It actually works, but unfortunately I have made misconfigurations in the VM and an hdd generally causes problems.
So it’s a problem of my own making

However, I have generally learned that Linux is much more efficient for such services, hence the switch. In addition, it is now specially purchased hardware that runs independently.

Alexey · February 13, 2024, 5:09am

PowerShell:

Get-ChildItem z:\Node-3\Storage\trash\ -File -Recurse | Where-Object {$_.LastWriteTime -lt (Get-Date).AddDays(-7)} | Measure-Object -Property Length -Sum | %{"Total {0:f2} MB`ncount {1}" -f ($_.Sum/1e6), $_.Count}

You may check that you do not have anything older than 7 days:

Get-ChildItem z:\Node-3\Storage\trash\ -File -Recurse | Where-Object {$_.LastWriteTime -lt (Get-Date).AddDays(-8)} | Measure-Object -Property Length -Sum | %{"Total {0:f2} MB`ncount {1}" -f ($_.Sum/1e6), $_.Count}

The dashboard took results from the DB, the DB is updated either by used space filewalker (after start), or by the trash chore (every 24h) or by the garbage collector (1-2 times per week) for each satellite independently.
Of course if these filewalkers are failed, they will not update a database.

CubeFan · February 13, 2024, 6:52am

ok i have executed both commands and both have not produced any results. I ran it again with 0 days and got the result 32527 files, in explorer I now get the same number.
It seems as if it has already been cleaned up since my post, which is great.
But how do I get the db files updated? Or is it like you said that it will do it in a week at the latest anyway? Could I force it?

Alexey · February 13, 2024, 7:05am

You could, just restart the service, and if you didn’t disable the filewalker on start, it should process data and updates databases.
You may check, how long it would proceed for each satellite:

sls "used-space-filewalker" "C:\Program Files\Storj\Storage Node\storagenode.log" | sls "start|finish" | select -last 10

CubeFan · February 13, 2024, 7:29am

Ok, I have restarted the service, these are the log entries so far:

Z:\Node-3\Log\storagenode.log:225524:2024-02-13T08:16:08+01:00  INFO    lazyfilewalker.used-space-filewalker    starting subprocess                                                             {"satelliteID":
"12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
Z:\Node-3\Log\storagenode.log:225525:2024-02-13T08:16:08+01:00  ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess                                                      {"satelliteID":
"12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"}
Z:\Node-3\Log\storagenode.log:225547:2024-02-13T08:16:48+01:00  INFO    lazyfilewalker.used-space-filewalker    starting subprocess                                                             {"satelliteID":
"1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
Z:\Node-3\Log\storagenode.log:225550:2024-02-13T08:16:48+01:00  INFO    lazyfilewalker.used-space-filewalker    subprocess started                                                              {"satelliteID":
"1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
Z:\Node-3\Log\storagenode.log:225567:2024-02-13T08:16:48+01:00  INFO    lazyfilewalker.used-space-filewalker.subprocess                                                                         Database started {"satelliteID":
"1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "process": "storagenode"}
Z:\Node-3\Log\storagenode.log:225568:2024-02-13T08:16:49+01:00  INFO    lazyfilewalker.used-space-filewalker.subprocess                                                                         used-space-filewalker started    {"satelliteID":
"1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "process": "storagenode"}

But I am surprised that the one satellite has an error

I’ll get back to you when all the satellites are through and the display is still not correct.
Thank you very much!

Alexey · February 13, 2024, 7:59am

This is a timeout from the disk, so it’s either overloaded or is too slow to respond. I guess that this satellite will not be accounted in the used space and the trash usage accordingly.
To fix a “too slow” if that a single local disk could be a defragmentation (if NTFS) or using a native filesystem (NTFS for Windows, ext4 for Linux).

daki82 · February 13, 2024, 8:37pm

And how would you argue that on my node? its also off by 100%. 15 to 30 GB
oh, and its still the same as in the screenshot. tough its not much.

Alexey · February 14, 2024, 3:05am

Do you have errors during used-space-filewalker or collector too?

daki82 · February 14, 2024, 10:39am

started logging for this, but after 8h still filewalk for 3.8TB Here the logs:

2024-02-14T08:03:48+01:00	ERROR	pieces:trash	emptying trash failed	{"error": "pieces error: filestore error: context canceled", "errorVerbose": "pieces error: filestore error: context canceled\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).EmptyTrash:176\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:316\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:416\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:83\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89"}
2024-02-14T08:03:48+01:00	ERROR	piecestore	download failed	{"Piece ID": "K2UV2HVSHHCSOMBNVBO3U6W2DPWD6FLQAX3NGENRR5FMN75M2H6Q", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 65536, "Remote Address": "79.127.220.97:59618", "error": "write tcp 192.168.2.124:28967->79.127.220.97:59618: use of closed network connection", "errorVerbose": "write tcp 192.168.2.124:28967->79.127.220.97:59618: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:401\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:462\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:349\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).sendData.func1:861\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2024-02-14T08:03:48+01:00	ERROR	pieces	failed to lazywalk space used by satellite	{"error": "lazyfilewalker: exit status 1", "errorVerbose": "lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:83\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-02-14T08:03:48+01:00	ERROR	lazyfilewalker.used-space-filewalker	failed to start subprocess	{"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "context canceled"}
2024-02-14T08:03:48+01:00	ERROR	pieces	failed to lazywalk space used by satellite	{"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:71\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-02-14T08:03:48+01:00	ERROR	lazyfilewalker.used-space-filewalker	failed to start subprocess	{"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "context canceled"}
2024-02-14T08:03:48+01:00	ERROR	pieces	failed to lazywalk space used by satellite	{"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:71\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-02-14T08:03:48+01:00	ERROR	lazyfilewalker.used-space-filewalker	failed to start subprocess	{"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"}
2024-02-14T08:03:48+01:00	ERROR	pieces	failed to lazywalk space used by satellite	{"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:71\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-02-14T08:03:48+01:00	ERROR	piecestore:cache	error getting current used space: 	{"error": "filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled", "errorVerbose": "group:\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:726\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:726\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:726\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:726\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2024-02-14T08:04:37+01:00	ERROR	piecestore	download failed	{"Piece ID": "BR5NTWMVSFWYYNZIF62WDXN4O36PTYP4K3EOFYQVOIE4YZADF32Q", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Remote Address": "79.127.201.209:46236", "error": "untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup us1.storj.io: operation was canceled", "errorVerbose": "untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup us1.storj.io: operation was canceled\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:140\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:62\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:616\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2024-02-14T08:04:37+01:00	ERROR	piecestore	download failed	{"Piece ID": "OSUL7V5SKNLUPDDCRWSB3FVM6M7YPJGE5TORJK4CI6T7TABYXKOA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Remote Address": "79.127.219.33:60376", "error": "trust: rpc: tcp connector failed: rpc: context canceled", "errorVerbose": "trust: rpc: tcp connector failed: rpc: context canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"}
2024-02-14T08:04:40+01:00	ERROR	piecestore	download failed	{"Piece ID": "E5E5E5BP6BXEU5JBDULB4CHTQP655QP6TJSOSXEKD4VP3IVCXKPQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET", "Remote Address": "79.127.220.99:36258", "error": "trust: rpc: tcp connector failed: rpc: context canceled", "errorVerbose": "trust: rpc: tcp connector failed: rpc: context canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"}

now trash is at 32GB via dashboard after less than 12h.

My guess here is: the lazy has a bug.
Its on all sats. disc is recently checked for errors and fragmentation is less than 0.2%. SSD read and ram writecache. Win11Pro with latest Updates.
walks usualy put 94%load to the disk. with good seektime 5-10ms.
i will log errors for a while, just in case.(usual im on fatal)
Let me know if i can help somehow.

Alexey · February 15, 2024, 7:32am

“context canceled” - timeout. Seems you need to switch off the lazy mode, it cannot finish successfully otherwise.

daki82 · February 15, 2024, 11:05am

done, lazy is off. node restarted.
will wait for finish and check the logs again.

after 1:30 filewalk done, no errors.
neighter trash nor used space show unusual values, compared to after the first restart.

i think the normal filewalker stepped in 10-12h after the 1st restart.

On next node restart: loglevel back to fatal.
lazy stays off.

Alexey · February 16, 2024, 4:21am

The used space filewalker didn’t appear in my logs right after the restart:

5-9 hours in my case