The node does not recheck the occupied space and therefore the disk is full

it does matter how the disk is connected, since it has timeouts (context canceled) during filewalker.
Please describe your setup in more details. What’s host and used hypervisor, how did you pass the disk to the VM?

I have many nodes and they are all made the same.
Hypervisor Hyper-V
Virtual hard disk - VHDX
The average response time is 3-4ms, even during defragmentation no more than 150-200ms at its peak.
I rebooted it many times, the disk is good and the implementation scheme has been tested for many years = working.

I can enable debug and send this file to technical support if it helps…

2 Likes

Yes, but at what cluster size?

Guys, I wrote that the node displays only 300GB, but 1,300GB is occupied. Even if there was a standard 4k or 64k cluster, it would still work perfectly on such a small node.
It is not possible to occupy 1 Terabyte of metadata in NTFS, besides
My actual size and occupied space are almost the same (here is an example of a small file):
Size 1.02 MB (1,071,360 bytes)
On Disk 1.03 MB (1,081,344 bytes)

To satisfy your assumption, let’s assume that there are 2,000,000 files on the disk and on each there is an overrun of 0.01 MB (I rounded this up with a margin) = 20,000 MB = only 20 GB of theoretical overrun on a 1 MB file.
The node is relatively young and does not have those old kilobyte files. This is this year’s node.

I say that at startup a node with a speed of several megabytes works for a couple of minutes and then the speed is 0.5 MB maximum - it does not load the disk… At the same time, the response is up to 3-5 ms

Yes, when you say that you use NTFS, it’s enough. Because exFAT uses 128kiB cluster size, you unlikely changed the cluster size for NTFS, so it’s not relevant.
In your case the only problem, that the filewalker and seems retain (search for “ERROR”/“failed” and “retain”) cannot finish their work, thus deleted data is not cleaned up from your node.

Did you remove the decommissioned satellites?

Since this problem often happen with virtual setups or network storages or SMR, I convinced that the problem in the huge response time of the storage, when the filewalker is trying to traverse all pieces on the disk.
Usually adding memory for the file cache can help, or perhaps you need to use a primocache or configure cache writes, but you must have an UPS in this case. The other way maybe to use a tiered storage with SSD.

You may also try to disable a lazy filewalker, it could finish faster with a higher priority.

We managed to recalculate the space.
Version - v1.93.0-rc
The config had:

storage2.piece-scan-on-startup: true
pieces.enable-lazy-filewalker: true - # disabled

The entire process took place at a disk speed of about 512 KB/sec. Delays are minimal.

It would be nice if the configuration had the ability to enable full power checking. This is not necessary for everyday work, but for repair work it is good to run it with maximum priority.

After a successful recalculation in the configuration, everything was turned on as before. I see the error again.

2023-11-29T07:06:20+02:00 ERROR piecestore download failed {Piece ID: 6B6CLZZKLWQEHU7PHHP7FL322ZIC4K4GNESMRGR6EZWUVRWCIYBA, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: GET, Remote Address: 79.127.203.193:59832, error: trust: rpc: tcp connector failed: rpc: context canceled, errorVerbose: trust: rpc: tcp connector failed: rpc: context canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190}
2023-11-29T07:06:41+02:00 ERROR pieces failed to lazywalk space used by satellite {error: lazyfilewalker: exit status 1, errorVerbose: lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:83\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75, Satellite ID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6}
2023-11-29T07:06:41+02:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {satelliteID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, error: context canceled}
2023-11-29T07:06:41+02:00 ERROR pieces failed to lazywalk space used by satellite {error: lazyfilewalker: context canceled, errorVerbose: lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:71\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S}
2023-11-29T07:06:41+02:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {satelliteID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, error: context canceled}
2023-11-29T07:06:41+02:00 ERROR pieces failed to lazywalk space used by satellite {error: lazyfilewalker: context canceled, errorVerbose: lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:71\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:717\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs}
2023-11-29T07:06:41+02:00 ERROR piecestore:cache error getting current used space: {error: filewalker: context canceled; filewalker: context canceled; filewalker: context canceled, errorVerbose: group:\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:726\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:726\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n— filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:69\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:74\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:726\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75}

Do you see these errors with the new version too?

yes
I deleted the log and rebooted the node, no errors yet.
But they were from yesterday on this version of the node.

In addition, I recalculated the size, but on the left side where storage still accrues little volume in hours of storage.

1 Like

Could you please measure a blobs folder? How long it would take?

A couple of days ago I checked when there was still a problem with volume recalculation.
It took about half a day, I can’t say for sure. But I know for sure that if you start somewhere in the middle of the day, and the next morning everything has already been counted, and apparently a long time ago.

I see, thank you.
For visibility - mine less than 2 hours. Usual HDDs. But no VM. Windows, NTFS, 10 TB in total

1Tb = 10 000 000 file
(10 000 000/120min)/60 = 1388 file/sec

I don’t quite believe ntfs is capable of this without a dedicated cache, but I’ll check on other nodes that have SSD cache or direct disk access.
And also, when I have time, I’ll run a check on this node.

I do not have a dedicated cache…, but I have 32GiB RAM, and at least 20GiB always available (free).

1 Like

For file count. It matches.
With cache its in the x0.000 thousands.per sec.

found in my logs, the lazy filewalker finished for around 6 hours in summary

2023-11-25T22:08:12Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started   {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "process": "storagenode", "createdBefore": "2023-11-21T17:59:59Z", "bloomFilterSize": 1854803}
2023-11-25T22:18:42Z    INFO    lazyfilewalker.gc-filewalker    subprocess finished successfully        {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2023-11-29T18:11:10Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started   {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode", "createdBefore": "2023-11-22T17:59:59Z", "bloomFilterSize": 2097155}
2023-11-29T23:48:59Z    INFO    lazyfilewalker.gc-filewalker    subprocess finished successfully        {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2023-11-30T16:10:43Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started   {"process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "process": "storagenode", "createdBefore": "2023-09-19T17:00:07Z", "bloomFilterSize": 364476}
2023-11-30T16:18:31Z    INFO    lazyfilewalker.gc-filewalker    subprocess finished successfully        {"process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2023-12-01T09:36:57Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started {"process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "bloomFilterSize": 566263, "process": "storagenode", "createdBefore": "2023-11-27T17:59:58Z"}
2023-12-01T09:55:05Z    INFO    lazyfilewalker.gc-filewalker    subprocess finished successfully    {"process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2023-12-02T21:42:46Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "process": "storagenode", "createdBefore": "2023-11-28T17:59:59Z", "bloomFilterSize": 1849134}
2023-12-02T21:53:43Z    INFO    lazyfilewalker.gc-filewalker    subprocess finished successfully    {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}


Average Disk Space Used Month - Does not change.
What to do about it?

Until filewalker would not finish its work - nothing will change.

He finished and re-read the size.
The right side shows correctly, but the left side does not change.

It also should move part (up to 90%) of the deleted data to the trash. The left side - is what your node reported as an actually used, everything above that - is deleted data, which should be moved to the trash by the garbage collector (gc-filewalker) and expiration removal (retain).