Used space gets reset after restart

Alexey · November 23, 2024, 7:53am

Ok. I was able to reproduce in storj-up.

used-space-filewalker doesn't update the usage correctly after restart

opened 07:52AM - 23 Nov 24 UTC

Bug

**Description** After restart of storagenode, the `used-space-filewalker` doesn't update the usage since v1.115.4 See https://forum.storj.io/t/used-space-gets-reset-after-restart/28408?u=alexey **Steps to reproduce the issue:** 1. used `storj-up` environment 2. upload several 128MiB files with TTL +2m (or any) 3. run `docker compose exec storagenode1 storagenode dashboard --address=:30002` to see a dashboard 4. Wait for the TTL collector chore to complete (better several hours) 5. Check the dashboard - it should show some non-zero value (perhaps the TTL collector doesn't update the usage again or not in all databases - we now have an additional database `used_space_per_prefix.db`) 6. upload several 128MiB files with TTL 7. Make sure, that the usage value on the dashboard has changed 8. Restart the node: ``` docker compose exec storagenode1 rm /var/lib/storj/.local/share/storj/storagenode/storage/trash/.trash-uses-day-dirs-indicator && docker compose restart storagenode1 ``` 9. The value would be the same, which was before uploading additional 128MiB files. 10. after the used-space-filewalker has completed the scan, the value will not change. **Describe the results you expected:** The `used-space-filewalker` should calculate the used space correctly **Describe the results you received:** The `used-space-filewalker` was finish in seconds, the used space will be invalid. **Possible Fix** Remove the `used_space_per_prefix.db` database and restart the node, the used space value will be updated correctly. **Your environment** - Operating system and version: any - Additional environment details (Raspberry PI, Docker, VMWare, etc.): any

Just need to keep it running for more than 24h.

Alexey · November 24, 2024, 3:15am

@mahir Do you use a dedicated disk feature?

snorkel · November 24, 2024, 8:08am

I observed this behavior, with file walker finished in ms on one of my nodes too, but I didn’t pay to much importance to it.
Ubuntu Server, 32GB RAM, no dedicated disk, no lazzy, no badger, 2 nodes on 2 drives, Docker, both with 3TB+ data.
I started the scan, and short after it finished, an update restarted the node and the scan started again, finishing very quickly. At that time I thought it was normal because the metadata was already cached in RAM. Now it seems there is a bug. I can’t remember if this happened before moving dbes to storage drive, or after. They were on OS m2 SSD. Normaly, the scan takes like 7+ hours.

Alexey · November 24, 2024, 8:13am

The quick scan is not a bug, it’s a feature to speedup the used-space-filewalker even more.

Do you have a discrepancy after it’s finished?

snorkel · November 24, 2024, 9:31am

Didn’t check at that moment. It seemed right. I believe it looses a few GB, and they are hard to spot on a 3TB node. They are visible on new small nodes.

Alexey · November 24, 2024, 9:49am

Yes, this is only confirms what I have tested.
The quick used-space-filewalker is not a bug.

However, there are some circumstances, where the difference may occur.

mahir · November 24, 2024, 3:36pm

what do you mean with dedicated disk feature?
its running inside a VM with no extra disk mounted.

acoustician · November 25, 2024, 12:19am

tried workaround on my node with ~410gb of data, got 357 in dashboard (didn’t help at all)

seems like it’s showing usage after first restart

EasyRhino · November 25, 2024, 12:40am

is what I’m seeing also the same problem?

satellites report .99TB used
used space on dashboard reports 1.4TB used.
df --si reports 1.8TB used
with a restart the used space filewalker finishes in 1 second per node.
delete these used_space_per_prefix.db and restart, and filewalker takes normal times (a few hours)
dashboard still showing same numbers 0.99TB vs 1.4TB

and i’m using the badger cache on this node, btw.

edortaprz · November 25, 2024, 1:57am

“seems like it’s showing usage after first restart” my nodes have same behaviour.

if I check database corruption it say OK but something happend because it fix the issue.

Alexey · November 25, 2024, 5:44am

If you deleted a database and restarted the node, then you need to wait until the used-space-filewalker would finish the scan for all trusted satellites (search for used-space and completed).

Alexey · November 25, 2024, 5:51am

No, there is discussed a difference between the usage on the piechart and the usage reported by OS. In your case 1.8TB vs 1.4TB.

The difference between the satellites reports and the actual usage is likely related either to a missing reports from the satellites or not collected garbage (for example, your node didn’t receive or yet processed the Bloom Filters from the satellites).

The quick used-space-filewalker is a feature, not a bug. But it may have a bug in the processing.
Does deleting the database eliminate the difference between 1.8TB (OS) and 1.4TB (dashboard)?

Alexey · November 25, 2024, 5:57am

This is a new feature which allows to do not calculate the used space but use the usage reported by OS directly. However, this is an experimental feature and the usage on the dashboard will be wrong.

acoustician · November 25, 2024, 6:29am

upd: all settled down after i opened dashboard after several hours

now it shows right disk usage

EasyRhino · November 25, 2024, 3:36pm

No it’s been more than a day since restart, filewalkers are done, and it’s still showing the 1.8 / 1.4 / 0.9 discrepancies. I’m using badger cache which i didn’t touch, btw.

Alexey · November 26, 2024, 6:20am

This case I cannot reproduce yet.
Could you please try to recreate a piece_spaced_used.db too?

foegra · November 26, 2024, 3:54pm

Seems same for me as well, every time node is restarted - used disk value goes to value, that was after 1st restart. I’m using TrueNas Scale.

EasyRhino · November 26, 2024, 6:44pm

to clarify, you want me to delete piece_spaced_used.db and restart and see what happens?

Alexey · November 27, 2024, 3:47am

Using this procedure:

because it wouldn’t be able to recreate only one missing database (because there are migrations).

EasyRhino · November 27, 2024, 6:45pm

Okay I did the sqlite3 database check (Everyone was ok), and deleted BOTH the pieces piece_spaced_used.db and used_space_per_prefix.db, and the dashboard is now showing nearly zero disk space usage but going up as the filewalker is running.

So… we’ll so how that goes!