The node does not recheck the occupied space and therefore the disk is full

My node showed that it had 300GB stored on it, but it took up 1TB more!
I tried rebooting, tried enabling and disabling these configuration options:

storage2.piece-scan-on-startup: true
pieces.enable-lazy-filewalker: true

Now I decided to delete the database files so that the system can re-check and recalculate the space.
But it didn’t help, this is what I get in the log file:

2023-11-28T07:08:48+02:00 ERROR pieces failed to lazywalk space used by satellite {error: lazyfilewalker: exit status 1, errorVerbose: lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:83\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S}
2023-11-28T07:08:48+02:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {satelliteID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, error: context canceled}
2023-11-28T07:08:48+02:00 ERROR pieces failed to lazywalk space used by satellite {error: lazyfilewalker: context canceled, errorVerbose: lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:71\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs}
2023-11-28T07:08:48+02:00 ERROR piecestore:cache error getting current used space: {error: filewalker: context canceled; filewalker: context canceled, errorVerbose: group:\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75}
2023-11-28T07:19:37+02:00 ERROR piecestore download failed {Piece ID: JUUGO3773PYJP5KHAJSYPXU75V22GKMIR6M4752KI5W6EENBCIYA, Satellite ID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6, Action: GET, Remote Address: 79.127.203.193:37308, error: trust: rpc: tcp connector failed: rpc: dial tcp 34.80.215.116:7777: operation was canceled, errorVerbose: trust: rpc: tcp connector failed: rpc: dial tcp 34.80.215.116:7777: operation was canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190}
2023-11-28T07:23:47+02:00 ERROR pieces failed to lazywalk space used by satellite {error: lazyfilewalker: exit status 1, errorVerbose: lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:83\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75, Satellite ID: 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE}
2023-11-28T07:23:47+02:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {satelliteID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6, error: context canceled}
2023-11-28T07:23:47+02:00 ERROR pieces failed to lazywalk space used by satellite {error: lazyfilewalker: context canceled, errorVerbose: lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:71\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75, Satellite ID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6}
2023-11-28T07:23:47+02:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {satelliteID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, error: context canceled}
2023-11-28T07:23:47+02:00 ERROR pieces failed to lazywalk space used by satellite {error: lazyfilewalker: context canceled, errorVerbose: lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:71\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S}
2023-11-28T07:23:47+02:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {satelliteID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, error: context canceled}
2023-11-28T07:23:47+02:00 ERROR pieces failed to lazywalk space used by satellite {error: lazyfilewalker: context canceled, errorVerbose: lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:71\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs}
2023-11-28T07:23:47+02:00 ERROR piecestore:cache error getting current used space: {error: filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled, errorVerbose: group:\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:726\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:726\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:726\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:726\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75}

Now in the configuration I indicated that the size was only 250GB so that no matter how it tried to load new data onto the node, only 300MB was left free on the disk because at the first problems I tried to reduce the minimum allowable size of the node so that it would start and begin to pass the check, but this Did not happen.
This problem is on a Windows server, check the disks, they are defragmented and everything is fine, but after a reboot the system does not begin to load the disk.
This problem was discovered 1-2 weeks ago. Versions are constantly updated automatically.
What to do?

What’s filesystem on this disk? If it’s exFAT, it will waste space because of huge cluster size, you need to migrate it to NTFS ASAP.

no of course I only use NTFS

Is it SMR? How is it connected?
Do you use any VM to run a Windows Server?

Yes, a Windows Server virtual machine is used, but the disk is dedicated and tested.
There is a lot of RAM, the disk is server and NOT SMR

It’s not a matter of hardware characteristics or file system, please look at the log.

1 Like

it does matter how the disk is connected, since it has timeouts (context canceled) during filewalker.
Please describe your setup in more details. What’s host and used hypervisor, how did you pass the disk to the VM?

I have many nodes and they are all made the same.
Hypervisor Hyper-V
Virtual hard disk - VHDX
The average response time is 3-4ms, even during defragmentation no more than 150-200ms at its peak.
I rebooted it many times, the disk is good and the implementation scheme has been tested for many years = working.

I can enable debug and send this file to technical support if it helps…

2 Likes

Yes, but at what cluster size?

Guys, I wrote that the node displays only 300GB, but 1,300GB is occupied. Even if there was a standard 4k or 64k cluster, it would still work perfectly on such a small node.
It is not possible to occupy 1 Terabyte of metadata in NTFS, besides
My actual size and occupied space are almost the same (here is an example of a small file):
Size 1.02 MB (1,071,360 bytes)
On Disk 1.03 MB (1,081,344 bytes)

To satisfy your assumption, let’s assume that there are 2,000,000 files on the disk and on each there is an overrun of 0.01 MB (I rounded this up with a margin) = 20,000 MB = only 20 GB of theoretical overrun on a 1 MB file.
The node is relatively young and does not have those old kilobyte files. This is this year’s node.

I say that at startup a node with a speed of several megabytes works for a couple of minutes and then the speed is 0.5 MB maximum - it does not load the disk… At the same time, the response is up to 3-5 ms

Yes, when you say that you use NTFS, it’s enough. Because exFAT uses 128kiB cluster size, you unlikely changed the cluster size for NTFS, so it’s not relevant.
In your case the only problem, that the filewalker and seems retain (search for “ERROR”/“failed” and “retain”) cannot finish their work, thus deleted data is not cleaned up from your node.

Did you remove the decommissioned satellites?

Since this problem often happen with virtual setups or network storages or SMR, I convinced that the problem in the huge response time of the storage, when the filewalker is trying to traverse all pieces on the disk.
Usually adding memory for the file cache can help, or perhaps you need to use a primocache or configure cache writes, but you must have an UPS in this case. The other way maybe to use a tiered storage with SSD.

You may also try to disable a lazy filewalker, it could finish faster with a higher priority.

We managed to recalculate the space.
Version - v1.93.0-rc
The config had:

storage2.piece-scan-on-startup: true
pieces.enable-lazy-filewalker: true - # disabled

The entire process took place at a disk speed of about 512 KB/sec. Delays are minimal.

It would be nice if the configuration had the ability to enable full power checking. This is not necessary for everyday work, but for repair work it is good to run it with maximum priority.

After a successful recalculation in the configuration, everything was turned on as before. I see the error again.

2023-11-29T07:06:20+02:00 ERROR piecestore download failed {Piece ID: 6B6CLZZKLWQEHU7PHHP7FL322ZIC4K4GNESMRGR6EZWUVRWCIYBA, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: GET, Remote Address: 79.127.203.193:59832, error: trust: rpc: tcp connector failed: rpc: context canceled, errorVerbose: trust: rpc: tcp connector failed: rpc: context canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190}
2023-11-29T07:06:41+02:00 ERROR pieces failed to lazywalk space used by satellite {error: lazyfilewalker: exit status 1, errorVerbose: lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:83\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75, Satellite ID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6}
2023-11-29T07:06:41+02:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {satelliteID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, error: context canceled}
2023-11-29T07:06:41+02:00 ERROR pieces failed to lazywalk space used by satellite {error: lazyfilewalker: context canceled, errorVerbose: lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:71\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S}
2023-11-29T07:06:41+02:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {satelliteID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, error: context canceled}
2023-11-29T07:06:41+02:00 ERROR pieces failed to lazywalk space used by satellite {error: lazyfilewalker: context canceled, errorVerbose: lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:71\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs}
2023-11-29T07:06:41+02:00 ERROR piecestore:cache error getting current used space: {error: filewalker: context canceled; filewalker: context canceled; filewalker: context canceled, errorVerbose: group:\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:726\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:726\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:726\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75}

Do you see these errors with the new version too?

yes
I deleted the log and rebooted the node, no errors yet.
But they were from yesterday on this version of the node.

In addition, I recalculated the size, but on the left side where storage still accrues little volume in hours of storage.

1 Like

Could you please measure a blobs folder? How long it would take?

A couple of days ago I checked when there was still a problem with volume recalculation.
It took about half a day, I can’t say for sure. But I know for sure that if you start somewhere in the middle of the day, and the next morning everything has already been counted, and apparently a long time ago.

I see, thank you.
For visibility - mine less than 2 hours. Usual HDDs. But no VM. Windows, NTFS, 10 TB in total

1Tb = 10 000 000 file
(10 000 000/120min)/60 = 1388 file/sec

I don’t quite believe ntfs is capable of this without a dedicated cache, but I’ll check on other nodes that have SSD cache or direct disk access.
And also, when I have time, I’ll run a check on this node.

I do not have a dedicated cache…, but I have 32GiB RAM, and at least 20GiB always available (free).

1 Like

For file count. It matches.
With cache its in the x0.000 thousands.per sec.