Debugging space usage discrepancies

d4rk4 · January 19, 2024, 10:07pm

I don’t know because it’s a colocation setup, I just rent a 42U rack.

About one drive per 2 mo. Most of them died after ~2.5y of uptime.

Toyoo · January 20, 2024, 12:02am

We can take advantage of the two-level directory structure here, and sort the two-letter directories on our own.

Remember the the last two-letter directory scanned. When running walkNamespaceInPath, first run all Readdirnames() to learn all two-letter directories, then sort the list alphabetically, only then run the inner loop, starting from the next two-letter directory. Update the database counter after every few two-letter directories, maybe not more often than every 5 minutes.

We could also try reducing the mismatch for uploads/deletes during the file walker, though the separate process for the file walker makes this cumbersome. It would go like: for each upload/delete, check whether its two-letter directory has been scanned or not. If it was, update the total. Setting up additional communication with the lazy file walker process might be quite cumbersome…

You don’t have a separate item for power on your colocation bill?

Alexey · January 20, 2024, 1:33am

it will report much less used space, than it is now. It should not update databases with temporary results.

clement · January 20, 2024, 3:00am

I agree with @Toyoo on taking advantage of the two-level directory structure. In fact, I’ve been thinking along these lines as well.

We can use the database as a means of communication since the lazyfilewalker subprocess also has access to the DB. Or we can pipe the data continuously to stdout just like we do for the logs until the parser detects the final data.

We can set it up to report it as incomplete data (but a cursor for the lazyfilewalker) and not be used till the lazyfilewalker is (actually) done with the remaining files.

daki82 · January 20, 2024, 12:35pm

Maybe the optimisation process under ultradefrag and ntfs can sort them on the disk in an order. I see such settings under optimisation tab in preferences

Toyoo · January 20, 2024, 12:50pm

Not possible to depend on. The first file uploaded after the defrag is complete may already place a new directory entry out of order.

lenzelott · January 20, 2024, 8:05pm

Thanks! That pushed me in the right direction, found multiple context canceled under used-space-filewalker.

So i just moved the Node to a more powerful host, container was peaking at 5GB RAM something during the used-space-filewalker process, and went down to 400MB once the process finished.

Other threads suggest that the reason for the context canceled is because the disk (Seagate IronWolf 6TB) cant keep up, but with no SSD cache there was nothing left than to just let it consume the RAM, just to complete the process.

2024-01-19T10:40:42Z    INFO    lazyfilewalker.used-space-filewalker.subprocess Database started        {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode"}
2024-01-19T10:40:42Z    INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode"}
2024-01-19T23:08:58Z    INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker completed {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode", "piecesTotal": 2583159110780, "piecesContentSize": 2573010075772}
2024-01-19T23:08:58Z    INFO    lazyfilewalker.used-space-filewalker    subprocess finished successfully        {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}

Thank you elek!

Alexey · January 21, 2024, 4:18am

This is usually an indication of a too slow disk subsystem.
Is it a single disk? How is it connected?

jammerdan · January 21, 2024, 4:30am

The mantra is: “Use what you have” and “Don’t invest”.
That means the storagenode software must adapt to all kinds of different setups and function properly.

Alexey · January 21, 2024, 4:39am

I’m agree, so we are trying to collect as much information as possible and we may help to solve the issue.
As a result it might improve the node or our documentation, like we did there:

lenzelott · January 21, 2024, 10:15am

Yeah i assumed so as well, single disk, single node over USB 3.0, ext4.

I have multiple nodes with a similar setup and same type of HDD, and only during the used-space-filewalker process for this node was it consuming that much, so i really cant explain it more than just classifying it as anomaly, no other faults through SMART or such detected, and also defrag already run.

This is from another Node running the same type of HDD, over USB 2.0, ext4 on an Raspberry PI 3 B+, container is limited to 400MB RAM, the disk is already full so it does not really have to deal with the filewalker running and accepting loads of data at the same time.

2024-01-19T20:11:31Z    INFO    lazyfilewalker.used-space-filewalker.subprocess Database started        {"process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "process": "storagenode"}
2024-01-19T20:11:31Z    INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "process": "storagenode"}
2024-01-19T22:53:08Z    INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker completed {"process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "process": "storagenode", "piecesTotal": 241228363008, "piecesContentSize": 241057498368}

Alexey · January 21, 2024, 12:02pm

Because of USB connection. However, if you have this amount of RAM available - this should not be an issue.

Toyoo · January 21, 2024, 5:14pm

No, this means we should have better guidelines on what are the lowest reasonable specs that a node requires, so that prospective node operators with low-spec setups won’t have expectations the node will work.

pdeline06 · January 24, 2024, 2:14pm

So will there be a solution of the problem?

nerdatwork · January 24, 2024, 2:27pm

You can track the github issue in above post.

pdeline06 · January 25, 2024, 12:24pm

I have a node with about 54M. Maybe you need data from this node?

elek · January 29, 2024, 7:20am

No, but thanks the offer. We have enough data, the problem can be reproduced with the collected data. We are waiting for the next release deployment, after that we will increase the size of the bloom filter.

It can be be bumped up to 5Mb. Which will slowly (!) cleanup all the nodes. For faster cleanup, we need more code changes (5Mb is the limit of the single request/response DRPC message. But we can switch to a streaming based approach, just needs more code changes).

snorkel · January 29, 2024, 7:22am

So all we need to do is to enable FileWalker and keep the node up to date?

daki82 · January 29, 2024, 9:08am

As always, i guess

elek · January 29, 2024, 10:05am

Yes, exactly…