Space used - significant discrepancy

mattventura · August 21, 2023, 9:55pm

I don’t think this is the usual “SI vs decimal” issue because the discrepancy is too big.

One of my nodes is set to a 2TB allocation, and it has filled up most of that. First question, why does the graph on the left disagree with the pie chart to such a large degree? Even subtracting out trash, that doesn’t add up.

Secondly, the larger issue: df -h reports the usage as 2.9TB. Where is the extra usage coming from? I tried to run du, but there’s just so many files that du takes an astronomical amount of time to go through either the trash or blobs folder. 5/6 of my nodes are like this. The other four have 4-5TB allocations, 1.2-1.4TB used according to the dashboard (slightly lower on the graph), but df reports 1.8-2.1TB used for each one. This is a major problem for me, because the payouts page seems to imply that I’m getting paid for the lower value. It’s also a problem because I risk running out of space on the underlying disk, even if I under-allocate by a fair bit.

This particular one is on a single-disk zpool, and df reports it as:

$ df -h /mnt/storj/a
Filesystem      Size  Used Avail Use% Mounted on
storage/stj-a   3.6T  2.9T  691G  81% /mnt/storj/a

Edit: It’s not trash. The trash folder size is 30GB - less than reported on the dashboard.

arrogantrabbit · August 22, 2023, 2:47am

Few things to consider:

Check databases for consistency, they may be corrupted
Check filesystem for consistency.
Get filewalker to run, so it accounts for all the stuff correctly
df is not a good measure here because used space quantizes by the sector size and quite a huge majority of the files are very small, so each file on average will waste half the sector size (unless you use filesystem like zfs with compression enabled). Node stores millions of files, and most of them are very small, and the amount of waste can be noticeable (gigabytes). I’ve just checked, compression ratio on my 1TB node is 1.57. Provided that storj data is encrypted and incompressible, had the filesystem not supported the compression, the space usage would have been that much higher. I strongly recommend using appropriate filesystem for the job.
What storagenode displays in the dashboard has no bearing on your payout (it would have been ripe for abuse if it was). Satellite does all the accounting.

My personal approach is to only monitor the node availability at its port and not pay attention to anything on the dashboard. Dashboard is a counterproductive distraction. IN fact, early on storj relied on the similar databases to account for chunks stored, and this turned out to be an unreliable disaster. So this approach is now only used use to keep inconsequential statistics, and chunks are stored in a form of filesystem based CAS. So I choose to ignore the dashboard.

mattventura · August 22, 2023, 3:54am

It’s already on ZFS, so FS corruption shouldn’t be an issue. I’ll try turning compression on, but I know it won’t apply retroactively. Question is - can I go through each file, and dd it onto itself safely to force zfs to rewrite it? The usual “copy, delete original, move copy over original” isn’t a good idea if the data might be read/written during that.

I restarted the node, that should be enough to get filewalker to run, right?

Also, is the storagenode smart enough to also take the actual FS space limit into account? My concern is that if I have a 3.6TiB drive and tell storj to use 3TiB of it, but I run into an issue like this, will I risk corruption when the 3TiB of actual data tries to use >3.6TiB of physical space?

Alexey · August 22, 2023, 4:31am

Hello @mattventura ,
Welcome to the forum!

After check of databases and filesystem you need to allow to run a filewalker. To force it to start you may restart the node. Your node should be online when the satellites sends bloom filter for the Garbage Collector, it should move unused pieces to the trash and then delete them after a week.

To check free space in SI units you need to use df --si. Note that this command shows free disk space, not free space in the allocation.

arrogantrabbit · August 22, 2023, 4:32am

Copying file by file will take ages.

When I added a special device and needed to “apply” the new configuration to existing data I went through an iterative zfs send | zfs receive approach to catch up then new dataset with the current state, and then stop node, and sync the remaining small amount of difference and continue the node from the new dataset, to minimize downtime to a literally few seconds.

From my notes:

Original dataset node is running from: `source`
New dataset: `target`

- Create recursive snapshot: 
         zfs snapshot -r pool1/source@cloning1

- Send to another new dataset with the node still running: 
         zfs send -Rv pool1/source@cloning1 |  zfs receive pool1/target

- Do it again to catch up changes since previous snapshot: 
        zfs snapshot -r pool1/source@cloning2
        zfs send -Rvi pool1/source@cloning1 pool1/source@cloning2 | zfs receive pool1/target

- Stop the node
        iocage stop storagenode

- Create final snapshot
        zfs snapshot -r pool1/source@cloning3

- Send incremental snapshot
        zfs send -Rvi pool1/source@cloning2 pool1/source@cloning3 | zfs receive pool1/target
  
- Change mount points on the jail
        iocage fstab -e storagenode

- Start the node.
        iocage start storagenode

Yes, if you haven’t disabled it, it should run. You will notice by a high IO pressure on your cache device an CPU usage on the node. I would also still check consistency of the databases (there was an article on how to do that)

Good question. I don’t know. But I personally dont’ rely on that and always have way more space than necessary (at least few extra TB at all times)

Alexey · August 22, 2023, 4:40am

Yes it’s smart enough, you may check in the log, it checks free space in the allocation and actual free space on the disk before accepting an offered piece. The another precaution - the node should stop accept pieces if there is less than 500MB in the allocation (or on the disk, what’s smaller).
However, we could introduce a bug, so it’s better to do not allocate the whole free space from the disk.

You may use an amount of free space using the command df -h (it will print it in binary units), but specify this number as an allocation in the node’s config using SI units, it will give you roughly 10% of difference.
So, for 3.6TiB of free disk space you may safely specify 3.6TB (i.e. in SI units) as allocated.

To see a difference between a data usage and the disk usage you may use these commands:
The data usage:

du -s --si --apparent-size /pool1/storj/storagenode/blobs

The actual disk usage:

du -s --si /pool1/storj/storagenode/blobs

SGC · August 22, 2023, 10:04am

i doubt its a database inconsistency thing.

if i was to hazard a guess your ashift is set to 11 which will make it write 2K blocks while your HDD might only be able to write 4k blocks…

this would lead to data amplification… this or something similar would be my gut shot guess.

please post your disk specs and your zfs pool ashift configuration.

it could also be something like a zvol running 512b on a pool with higher ashift than 9.

ashift doubles the sector size each time.
ashift 9 starts at 512, 10 = 1024, 11 = 2048, 12 = 4096 (4Kn)

mattventura · August 22, 2023, 2:09pm

I see the issue with both a single-disk pool and a 4-disk RAIDZ1.

They’re both set to 4k ashift, but one of the zpools is made of 4kn drives while the other is 512e drives, so performance would be awful if I used 512 ashift.

SGC · August 22, 2023, 2:21pm

4K ashift works fine with 512b drives… because it simply writes across multiple sectors.

data amplification usually results when its the other way around… like say attempting ashift 9 / 512b sector sizes on a 4Kn drive…

then it would write 512b pr 4K sector, thus leading to a 800% write amplification

mattventura · August 24, 2023, 4:20pm

Well, it’s a somewhat old drive, HDN726040ALE614:

NAME   OPT-IO MIN-IO PHY-SEC LOG-SEC MODEL
sdg         0   4096    4096     512 HGST_HDN726040ALE614
├─sdg1      0   4096    4096     512
└─sdg9      0   4096    4096     512

snorkel · August 24, 2023, 7:22pm

512e drives are everywhere. Exos comes in 512e, Toshiba offers 512e or 4Kn drives, etc.