Disk usage discrepancy?

jeremyfritzen · January 4, 2024, 3:04am

Hi!

On my system (Raspberry Pi 3 with Ubuntu server), most of my Storj disks are full, as we can see with df -h:

Filesystem      Size  Used Avail Use% Mounted on
/dev/root       7.0G  3.5G  3.2G  53% /
devtmpfs        1.8G     0  1.8G   0% /dev
tmpfs           1.9G     0  1.9G   0% /dev/shm
tmpfs           768M  9.4M  759M   2% /run
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
/dev/sdf1       253M   49M  204M  20% /boot
/dev/sdb1       2.7T  2.6T  441M 100% /media/storj3
/dev/sde1       7.3T  6.9T  474M 100% /media/storj5
/dev/sdd1       2.7T  2.6T   21G 100% /media/storj1
/dev/sda1       3.6T  3.4T  550M 100% /media/storj4
/dev/sdc1       3.6T  2.1T  1.4T  61% /media/storj2
tmpfs           384M     0  384M   0% /run/user/1000

But Storj dashboards display that there is lot of free space (screenshot below for sdd1):
.

What should I do to fix this discrepency?
Can I flush some old data on the storage disk used by the Storj container?

Thanks for your help!

Alexey · January 4, 2024, 3:25am

Enable filewalker, if it’s disabled, switch off the lazy one, fix FATAL errors if they are exist and restart the node. Let it run for at least a month like this to remove all garbage, when the discrepancy would be eliminated, you may try to revert changes and monitor.

The main problem is interrupted filewalker. The reason is any restart (it will start from the beginning), so it should finish its work to update databases with correct numbers and to remove the garbage.

jeremyfritzen · January 4, 2024, 3:55am

Sorry how can I enable or disable the filewalker?
I saw some threads about it but I’m confused and a little bit lost.

kchiem · January 4, 2024, 4:50am

How is this possible when it auto-updates every couple of weeks on average?

snorkel · January 4, 2024, 5:41am

https://forum.storj.io/t/tuning-the-filewalker/19203?u=snorkel

https://forum.storj.io/t/release-preparation-v1-78/22472/6?u=snorkel

Change these 2 settings in config.yaml, save it, stop node, remove node, start node:

storage2.piece-scan-on-startup: true|false
pieces.enable-lazy-filewalker: true|false

First setting enables or disables the Filewalker; the second one enables or disables the lazy mode aka the low priority mode; it runs the FW with low priority, increasing the responsivness of the system, but it takes more time to complete. By turning this off, the FW runs with normal priority compleating the run quickly, but making the system harder to respond.

Alexey · January 4, 2024, 8:07am

If you do not know how to disable the filewalker, you likely didn’t that, it’s enabled by default.
But you may try to disable the lazy one to give it more priority and possibility to finish earlier and successful.

# Use the lazy filewalker, it's true by default
pieces.enable-lazy-filewalker: false

Alexey · January 4, 2024, 8:09am

The filewalker should finish its work within 2 weeks, otherwise your node likely has a bigger issue - the disk subsystem is incredible slow and needed the investigation, it could be dying disk or overloaded with something else, or it has a not efficient filesystem.

jeremyfritzen · January 5, 2024, 1:58am

Thanks!

Is it normal that I don´t see these 2 settings in my current config.yaml file?

zip · January 5, 2024, 2:19am

Lazy filewalker and skip startup scan functionality was added just recently.
If you ran the setup command (which creates the node configuration file) before this was implemented, then these options are missing in your configuration file.
These are of course supported in recent storagenode software versions, so you can add them manually.
These would also be present if you would run setup command on current storagenode software version, but do not do that on existing node.

jeremyfritzen · January 5, 2024, 2:29am

thanks!
So, in order to fix the discrepency, I should add these 2 lines on my config.yaml file?

storage2.piece-scan-on-startup: true
pieces.enable-lazy-filewalker: false

And then wait for 2 weeks?

How can I make sure it is well enabled and the “cleaning” is working well?

zip · January 5, 2024, 2:44am

You also will need to restart the node after changing the configuration file.

Basically this will cause the node to go over all the files stored on the node and count their total size. Once done, it should update the storagenode databases to reflect what is actually stored on the node.
Disabling the lazy filewalker means it will run with standard I/O priority and not as a separate process with lower I/O priority.
You should be able to tell this is working by looking at the drive utilization - it should be pretty much at 100% until the node will go over all the files stored.

I also believe it should be mentioned in the log, some log message with “walk” in it.

And you probably should take a look at how to remove the decommissioned satellites data if you haven’t already, ideally before running the filewalker:

snorkel · January 5, 2024, 6:50am

You can see all commands with this:

docker exec -it storagenode /app/storagenode help

Just run it in a terminal when storagenode runs, no need to stop it. Also, maybe it needs sudo, I can’t remember.

To get the updated config.yaml (you don’t need it), install a new node with different name and different paths, and different identity. Just running the setup step it gives you the new config, I believe. But I’m not a linux expert, I don’t know how to get rid of that installation afterwords.
You can also wait a little, because I will install a few nodes today and I will post the new config.

snorkel · January 6, 2024, 11:29pm

First node full, but not according to sats data. Very big difference. Who pays for my used unaccounted space? And when will it stop receiving data?
Filewalker is on for 2 weeks now, and the node had 3 restarts for some maintenance. FW finished each time. All other 8 nodes display big differences and are almost full.

daki82 · January 7, 2024, 12:44am

Here is my cleaned fast node, for comparison.

+~50-60GB per day is the grow rate (my older node is neraly full here comes the pic)

could it just be the 1024 vs 1000 factor?

toyberg90 · January 7, 2024, 1:21am

On Windows only look at the actual bytes. In the first picture Windows is reporting 3.094TB and in the second one 9.841TB. Which looks like its close to the dashboard.
Microsoft really didn’t do anyone a favour by writing MB, GB, TB but calculating for MiB, GiB, TiB.

Ruskiem · January 7, 2024, 2:08am

i think this small discrepancy between 12.69TB and 14.4 TB = 1,71TB
not indicating nothing wrong at this point, because:
last month the ingress was around 2-2,5TB (judging by all my nodes)
And how much of this data was deleted? and in what time after landing on Your HDD?
Not sure if we can know this.
My understanding is that, because there can be huge deletes every month
there always will be some discrepancy from Used parameter, because
FileWalker on such large data like 14.4TB can take several days to complete,
and wouldn’t be able to catch that always right away.

Good news is i think, the more Your node will be full, the less that discrepancy should be,
i guess.

ALTHO i think we should get an official statement about this, @bre
since there are several topis about that discrepancy and people seems to be confused.

ADDITIONALLY, i think we should receive NEW indicators in a node dashboard:

Showing us, when the filewalker was last time completed on each satelite,
just like time online, example:

“FileWalker status: us1 completed: 72h ago”,
or:
“FilerWalker: us1 counting… (don’t turn off the node now)”

And the FW should take priorities and start from the satellites that are less up to date!

AiS1972 · January 7, 2024, 6:38am

“FilerWalker: us1 working, 10% complete… (don’t turn off the node now)”

snorkel · January 7, 2024, 8:00am

I’m still confused about the relation between FW (filewalker) and sats and space used…
FW reads the used space (files, metadata, part of files, whatever) and finishes in let’s say 12 hours, at 20:00 o’clock.
Than it tells to satellite US1 that at 20:00 the space used is 10TB.
The satellite updates it’s database with the new value and starts counting new files stored from 20:00.
But in the time spent by FW working, files are deleted to trash and new files are added to the node. (They are not files, but pieces of files, dosen’t matter.)
So, I wonder if FW is taking notice, in those 12 hours of running, of the files deleted and added?
And from what I see, I believe it dosen’t, and this causes the discrepancy, which increases with each FW run.
In this case, the FW causes more problems than it solves.

Solution 1: get rid of FW and trust the node that it dosen’t loose files.

Solution 2: when FW runs, node pauses adding and deleting files.

Solution 3: when FW runs, new files are added to a temp folder, and deteles are paused; after FW finishes with the main folders, it quickly counts the temp folder, it completes the stand-by deletions, makes the math about the total space used like “main folders + temp folders - deletes” and sends the results to satellites. Than the normal activity resumes, the files from temp folder are moved to final folders, etc.

Solution 4: do mini FW runs on individual folders, and send the result to sats after each folder, but I’m not very clear about this method, and I think it will complicate things for satellites by increasing the database exponentialy.

snorkel · January 7, 2024, 8:08am

And the fact that FW creates more discrepancy with each run is confirmed by my nodes; I have 2 nodes on 1GB RAM that have FW off since a year ago, I believe, and the discrepancy about what satellites report as used space and what dashboard/OS reports has stayed the same all this time.
On nodes with big RAM where I enabled FW, the discrepancy increases with each node restart/FW run.
This should be addressed by Storj by modifying the code in some way or start adding in the official documentation some recommendation like: don’t start bigger nodes than XX TB, if you can’t provide a form of cache for accelarate the FW runs, or add this type of cache for this xxx (databases, logs, etc) in this size xxx, for nodes bigger than XX TB.

daki82 · January 7, 2024, 8:12am

maybe its this way?