Filewalker not running when disabling lazyfilewalker

flwstern · June 20, 2024, 8:38am

make it configable default 5gb but be able to set whatever value you want.

jammerdan · June 20, 2024, 11:35am

I can tell the following:

One of my nodes on v 1.104.5 shows 3.5 TB of trash from node data but only 1.2 TB from du.

Filewalker on startup is off as I need to wait for the resume feature.
Lazy filewalker is on.

Now I don’t know how the database update is supposed to work and if the trash filewalker scans the size of the trash folder independently of the used-space filewalker. If that’s the case then the size should match the du size and this is obviously not the case here.

Mitsos · June 20, 2024, 11:40am

trash doesn’t scan for disk usage. It just deletes the per-day directories and updates the database based on what it deleted.

Unless you run used-space at least once, the actual vs database usage will never match.

jammerdan · June 20, 2024, 11:50am

I see, then it is natural that it does not match the du output.
Then let’s wait and see if the nodes trash space value changes.

littleskunk · June 20, 2024, 11:59am

Without running the used space filewalker? The trash space will get updated by garbage collection and the trash cleanup but they both don’t scan the size of the trash folder. They only increment / decrement the value with the amount they moved into or out of the trash folder. Only the used space filewalker will actually sync it with the file system.

jammerdan · June 20, 2024, 12:02pm

Yes. I understand you answer as the given space should be incremented or decremented when the updates are performed. So it would show if updates are working or not.

thelastspark · June 20, 2024, 1:21pm

I do: ERROR filewalker failed to get progress from database - #24 by thelastspark

And because the normal filewalker doesn’t log anything I have no idea if it works or doesn’t either

thelastspark · June 20, 2024, 1:23pm

Take a look at your logs - I wonder if you are having the same errors as me (my post above)

flwstern · June 20, 2024, 1:23pm

@littleskunk

If i just want to reset the dashboard and dont care about the stats and disable the filewalker which db should i delete? Is it enough with space db? Will that remove the trash stats on the node?

Virtual-Dynamics · June 20, 2024, 10:58pm

With df --si, the OS says there is 16TB of data on the drive, so 6 TB more than what the node dashboard says I’m storing. Storj is the only thing using this drive and I don’t know how figure out what is causing the discrepancy.

Alexey · June 21, 2024, 3:53am

Please also note that the dashboard shows the allocated size and the free space in the allocation, not on the disk.
So actually you need to calculate the size of the data location to compare:

du --si -d 1 /mnt/storj/storagenode/storage

This also suggests that you likely have issues with the databases (they were not updated with the increment).
Please search for errors related to the databases in your logs. Perhaps also it’s better to check them too:

And if you didn’t disable the scan on startup (it’s enabled by default), and do not have errors related to the filewalkers and databases, the used-space-filewalker should update your databases with actual values. It’s running only after a restart.
The used-space-filewalker should be successfully finished for each trusted satellites

and you should not have untrusted satellites data:

littleskunk · June 21, 2024, 9:16am

If you reset the DB your dashboard will also show 0 used space. I don’t think that is what you want.

You can update the numbers that are in the DB to match what ever you want to see on the dashboard. That is possible.

Virtual-Dynamics · June 22, 2024, 1:56pm

I will try to check the database health later this weekend.

I checked the logs, and I don’t see any “database is locked” or filewalker errors. The only error message I see is “piecestore download failed,” either due to a connection timeout or use of a closed network connection.

I see messages that the lazyfilewalker started and that at least some of them finished successfully.

Alexey · June 22, 2024, 3:20pm

This is the usual error when your node cannot keep up. Your node cannot be close to everyone customer in the world. So, it’s pretty normal.
But, if you have more than 59% of failed upload rate, it’s perhaps the point to figure out - “why?!”.

pangolin · June 22, 2024, 3:49pm

Yes, but ‘close’ in terms of internet routing, not geography. Actually my only US node (IP location) is the worst performer for SLC test data.

Virtual-Dynamics · June 22, 2024, 5:08pm

I count 18 instances of the “piecestore download failed” message over 2 days, so if I wasn’t specifically filtering for them they would be lost in all of the regular upload and download info messages. If that message is expected to occur occasionally on a properly operating node, I don’t seen an issue with the frequency.

pangolin · June 22, 2024, 5:15pm

Yes. These messages are more like INFO.

Mad_Max · July 2, 2024, 8:43pm

Do you have any optimizations/conditions for this database?
I’ve had it stay a little over 200 MB in size for over a year. Until, on one of my nodes, I have not moved the database folder from the HDD (the same one where the pieces are stored) to the SSD in order to unload the overloaded HDD a little. And right from the moment of the restart with the changed config, this database began to grow like crazy! From ~250 MB to over 7000 MB at the moment in just 3-4 weeks.

I thought it was some kind of glitch, I even checked the integrity of the database (using the sqlite3 utility), but it passed the test. And when I opened it (exported it as text), I found almost 20 million correct entries there. At the same time, on another node (which all this time received traffic close in volume), but on which the database was left on the HDD - this database remained about 300 MB in size and practically does not grow.

So I’m curious - if it’s a bug that a node with a database stored along the default path(same as /blobs/ and /trash/) ignores and does NOT write TTL data?. Or is it such a deliberate optimization - and you deliberately refused to record TTL information if DB stored on HDD?

littleskunk · July 2, 2024, 9:21pm

Move DBs to an SSD is usually a good advice. Beside that I am not aware of any other tricks.

That is by design. If a node can’t write the db file for what ever reason it will still continue storing pieces. It will backfire. All the pieces that are missing in the TTL database will have to make there way through garbage collection.

Mad_Max · July 4, 2024, 6:21pm

It looks like you didn’t quite get me right.
BUT node CAN write to DB. But it DIDN’T WANT to do this for some (unknown yet) reason.

On the first node while the database was stored in the default path (there was no separate parameter for the database path in the config set, so it was stored in the same folder as /blobs/ and /trash/ on HDD taken from a general path in storage.path: parameter ).
More precisely, it was actually writing to it (as i see timestamps and size change a little every day), but in very small volumes.

But as soon as I moved the database from HDD to the SSD, changed the config.yaml accordingly (I explicitly specified a new path to DB through the separate parameter storage2.database-dir:) and restarted node to apply these changes - it almost immediately increased volume of writing to piece_expiration.db something like ~100x fold. And continues to do so so far at a similar pace for last few weeks. As far as I understand, this is due to the fact that the current large test traffic from Saltlake satellite is all coming with TTL set. So it probably normal/right.

Whereas on the 2nd node writing into this database still remains at a very low levels.
Both nodes have a similar setup and is even located behind the same external ip, so that they share/split the same incoming traffic approximately equally. The only significant difference is that the database on the second node has remained on the HDD in the default path with a storage of pieces.

That’s why I asked this question - this is some kind of optimization trick of yours (maybe some kind of database access speed test?) and TTL recording only goes if access to the database is fast enough, otherwise this information is discarded.

Or is there something strange/wrong when storing the database on the HDD (in the default path) and it’s worth looking into it for potential bugs?
I have already checked the logs, and I have not found any errors regarding database access. It just feels like the node “DOESN’T WANT” to write massive TTL data while the database is on the HDD.