I’ve been facing the following issue: a while ago, I recreated all databases that contain, for example, used-space, etc.
Since then, I haven’t been able to see my total used space, as it resets every time the node restarts (it handles about 800GB and then restarts). I have always thought that it was due to IOPS/latency issues I had with my previous storage solution, but now I’ve switched to ZFS with a special vdev (enterprise U2 NVMe), 128GB ARC cache, fast disks.
And the filewalker never completes. I turned on info logging and saw the following log:
C:\Program Files\Storj\Storage Node\storagenode.log:313923:2024-02-23T12:36:30+01:00 INFO lazyfilewalker.used-space-filewalker subprocess started {"satelliteID": "12rfG3sh9N
CWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
C:\Program Files\Storj\Storage Node\storagenode.log:314005:2024-02-23T12:36:30+01:00 INFO lazyfilewalker.used-space-filewalker.subprocess Database started {"satelliteID": "1
2rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "process": "storagenode"}
C:\Program Files\Storj\Storage Node\storagenode.log:314006:2024-02-23T12:36:30+01:00 INFO lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started{"sat
elliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "process": "storagenode"}
C:\Program Files\Storj\Storage Node\storagenode.log:317651:2024-02-23T12:45:27+01:00 INFO lazyfilewalker.used-space-filewalker subprocess finished successfully {"satelliteID
": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
C:\Program Files\Storj\Storage Node\storagenode.log:317653:2024-02-23T12:45:27+01:00 INFO lazyfilewalker.used-space-filewalker subprocess started {"satelliteID": "12tRQrMTWU
WwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
C:\Program Files\Storj\Storage Node\storagenode.log:317654:2024-02-23T12:45:27+01:00 INFO lazyfilewalker.used-space-filewalker.subprocess Database started {"satelliteID": "1
2tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "process": "storagenode"}
C:\Program Files\Storj\Storage Node\storagenode.log:317655:2024-02-23T12:45:27+01:00 INFO lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started{"sat
elliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "process": "storagenode"}
C:\Program Files\Storj\Storage Node\storagenode.log:317753:2024-02-23T12:45:45+01:00 INFO lazyfilewalker.used-space-filewalker subprocess finished successfully {"satelliteID
": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
C:\Program Files\Storj\Storage Node\storagenode.log:317755:2024-02-23T12:45:45+01:00 INFO lazyfilewalker.used-space-filewalker subprocess started {"satelliteID": "1wFTAgs9DP
5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
C:\Program Files\Storj\Storage Node\storagenode.log:317756:2024-02-23T12:45:45+01:00 INFO lazyfilewalker.used-space-filewalker.subprocess Database started {"satelliteID": "1
wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "process": "storagenode"}
C:\Program Files\Storj\Storage Node\storagenode.log:317757:2024-02-23T12:45:45+01:00 INFO lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started{"sat
elliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "process": "storagenode"}
C:\Program Files\Storj\Storage Node\storagenode.log:320733:2024-02-23T12:54:17+01:00 INFO lazyfilewalker.used-space-filewalker subprocess finished successfully {"satelliteID
": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
C:\Program Files\Storj\Storage Node\storagenode.log:320735:2024-02-23T12:54:17+01:00 INFO lazyfilewalker.used-space-filewalker subprocess started {"satelliteID": "121RTSDpyN
ZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
C:\Program Files\Storj\Storage Node\storagenode.log:320739:2024-02-23T12:54:17+01:00 INFO lazyfilewalker.used-space-filewalker.subprocess Database started {"satelliteID": "1
21RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "process": "storagenode"}
C:\Program Files\Storj\Storage Node\storagenode.log:320740:2024-02-23T12:54:17+01:00 INFO lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started{"sat
elliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "process": "storagenode"}
It seems to be that the filewalker is indeed finished but is not updated anyways, what can i do to resolve this?
On my other node which im not running lazy filewalker but normal filewalker i get this error message:
Would be great if you could add whenever the filewalker saves its progress to output how much it has walked since start. Great for us storj operators to be able to see how fast its going somehow, to fine tune our systems.
what’s the next error after that? I guess - “context canceled” = timeout = disk is too slow to respond.
How this disk is connected? is it a VM (I guess - yes), or, even worse - a Windows VM on the Linux host with QM emulation like Proxmox?
If the answer is - “yes”, then - either make it bare metal, or avoid Windows and use LXC containers directly on Proxmox, or at least Linux VM (but DO NOT use NTFS!!!) instead.
2024-02-24T08:43:43+01:00 INFO piecestore uploaded {"Piece ID": "YJLJC4HO34CV5F67EEHOXYT2DDV26JVE5S3UKADAUINT2UZXWASQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Size": 2174464, "Remote Address": "79.127.205.226:36002"}
2024-02-24T08:43:43+01:00 INFO lazyfilewalker.used-space-filewalker subprocess exited with status {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "status": 2, "error": "exit status 2"}
2024-02-24T08:43:43+01:00 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: exit status 2", "errorVerbose": "lazyfilewalker: exit status 2\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:83\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-02-24T08:43:43+01:00 INFO piecestore download started {"Piece ID": "O2QWIZ32LEVPBYRCC6J5WYPPHJTTEISWIAWMTXNJITUWHAQO23OQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 256, "Remote Address": "128.140.89.26:35830"}
2024-02-24T08:43:44+01:00 INFO piecestore download started {"Piece ID": "7DRBBSC35YKSDZHKZSJCAOCCPZEEIQTLXSLSMKGFET4AV7J2WEOQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 181248, "Remote Address": "79.127.220.99:34048"}
No more errors after that. That is the last error.
This is VM on VMware storage over iscsi 40gbit fiber with qfx juniper switches to truenas.
OS is on nvme with 800k iops
Databases on nvme with 800k iops
Storj data is on 2x striped 7200 rpms WD Datacenter 18tb drives with 128gb RAM cache and L2 arc cache also 800k iops.
I dont not have any latency issue on storj data (10 ms avg)
I dont have any disk congestion.
I believe that is exactly response after which it will not continue…
I would try to collect a list…
each of these is increasing a filewalk
I guess, that you also uses ZFS or even worse - BTRFS under the hood.
and, of course, without a special VDEV, to make it completely bad, you likely used a single drive volume. Am I right?
Disclaimer - it’s not an attempt to offense anyone. Just a collection, what could be done worse than a simple ext4 drive…
performance is confirmed by filewalk, no offense. it’s a fact.
If you have problems with a filewalk - your setup is not optimal (but we have a bug here, so we share).
even worse, but we are the architects of our own happiness.