Debugging space usage discrepancies

d4rk4 · January 18, 2024, 9:24pm

Roxor · January 18, 2024, 10:26pm

Wow: that’s like 600TB raw space? 30 x 20TB HDDs or so? Impressive!

d4rk4 · January 18, 2024, 10:41pm

ZFS on top of a small bunch of ST5000LM000’s (=

Alexey · January 19, 2024, 1:20am

What’s reason for restart? It should be right before it was shutdown/killed.

snorkel · January 19, 2024, 2:44am

@d4rk4
Hmm… 120 x 5TB Barracuda 2.5" drives, SMR, 5400RPM, 2.4W, 34$/TB on Amazon.
The only nice thing about them is the power consumption.
How much power for the entire system?
Did any of them failed?

jammerdan · January 19, 2024, 4:24am

I second that. It should never start from the beginning after an interruption. It should periodically save its state and resume from the last one after interruption.

Alexey · January 19, 2024, 4:44am

To where? If to their databases, the data will be incorrect more than now.
I suppose you suggests to store it separately like a some lock file? It may work, but I think it should also have an expiration date less than a few hours maybe?

jammerdan · January 19, 2024, 4:48am

Yes.

Why only a few hours? As I see it at least the used space file walker is of low importance. You can even turn it off. So it does not have to and never will be 100% accurate. And it is not needed for payout accounting. So I think i can easily live with 48 hours expiration.

Alexey · January 19, 2024, 4:50am

no, it will report a wrong free space to the satellites until completed. So your node may be selected for uploads and then reject upload because there is no free space actually - the customers will be significantly affected.

jammerdan · January 19, 2024, 4:52am

That’s true but no difference to how it is today. If the file walker never completes how is this better? If I turn it off completely how is this different?

Alexey · January 19, 2024, 4:54am

Perhaps you are right, but we want to make it better, not worse/the same?

jammerdan · January 19, 2024, 5:01am

I think most important is, that the FW completes at all. There is absolutely no sense in running the FW over and over and over when it does not finish. With an expiry set too low it might not break out of that loop. But it doesn’t make sense at this stage to discuss about hours. We should discuss about the goals that we need to achieve.
And there I am saying: A FW that does not complete makes no sense in any way. Constant restarts from the beginning are one reason for that. So engineers should find a way to solve that problem.

Alexey · January 19, 2024, 7:12am

I hope that our tests regarding

Would be successful. In this case the scan time will become negligible.

elek · January 19, 2024, 8:10am

This would be nice, but for this, the file system should guarantee the order of the listed elements. This is not guaranteed IMHO…

An other shortcut what Apache Hadoop uses (for the same reason): if you dedicate a partition to every single storagenode, (LVM volume, for example), the partition level free space query (like df -h) is usually very fast, and can be used as an estimate…

lenzelott · January 19, 2024, 9:27am

First time poster longtime lurker, so far all of the posts and information have helped me solve a lot of things, but this is the first time that i need to post something to get help as i am totally out of ideas even though i´ve browsed so many topics.

I don’t know if it belongs in this topic and if its 100% related to this.

Right now i have assigned 5.3TB to the node but the GUI is reporting 0,88 free which does not add up with the actual storage used.

/dev/sda1 ext4 6.0T 5.4T 302G 95% /mnt/storj

2024-01-19T07:54:23Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker completed {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode", "piecesCount": 19444598, "piecesSkippedCount": 0}
2024-01-19T07:55:08Z    INFO    lazyfilewalker.gc-filewalker    subprocess finished successfully        {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}

And the strange thing is, whenever i restart the node, it just resets the storage used and jump back up to 1,08TB free, every single time i restart the node.

Slowly but surely i am worried that if this keeps up that the disk is gonna get filled to the max and that i am gonna get screwed.

I followed the other thread to see if i that would be something but its not exactly the same:
https://forum.storj.io/t/the-node-does-not-recheck-the-occupied-space-and-therefore-the-disk-is-full

Node is running v1.94.2.
storage2.piece-scan-on-startup: true, is also enabled.

elek · January 19, 2024, 9:47am

lenzelott:

2024-01-19T07:54:23Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker completed {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode", "piecesCount": 19444598, "piecesSkippedCount": 0}
2024-01-19T07:55:08Z    INFO    lazyfilewalker.gc-filewalker    subprocess finished successfully        {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}

We have two file walkers. This is GC. There is an other one which calculates the used space. Try to find that one in the log.

The results supposed to be saved to the database. If not saved, the old value is used at the next restart. Looks like this is the case. The file walker might be crashed, or couldn’t save the results to the database…

zip · January 19, 2024, 10:42am

This certainly should be a configuration option. It would solve some of the issues we are currently experiencing, at least for nodes that are using the whole partition as storage.

snorkel · January 19, 2024, 1:12pm

So, it would look like this?

Storage space used = Disk capacity - free space - all other folders on disk excluding blobs

So the FW would only go through Trash, Orders, db-es, free space etc. This would be like under 1 min.

daki82 · January 19, 2024, 8:52pm

Thats a bit short of a thinking, and exposes some manipulation issues…imho

snorkel · January 19, 2024, 9:08pm

As I understand it, FW is just for operator to see the real numbers that satellites see; it dosen’t influence the records on satellites.
Did I understand it wrong?