Space used - significant discrepancy

I don’t think this is the usual “SI vs decimal” issue because the discrepancy is too big.

One of my nodes is set to a 2TB allocation, and it has filled up most of that. First question, why does the graph on the left disagree with the pie chart to such a large degree? Even subtracting out trash, that doesn’t add up.

Secondly, the larger issue: df -h reports the usage as 2.9TB. Where is the extra usage coming from? I tried to run du, but there’s just so many files that du takes an astronomical amount of time to go through either the trash or blobs folder. 5/6 of my nodes are like this. The other four have 4-5TB allocations, 1.2-1.4TB used according to the dashboard (slightly lower on the graph), but df reports 1.8-2.1TB used for each one. This is a major problem for me, because the payouts page seems to imply that I’m getting paid for the lower value. It’s also a problem because I risk running out of space on the underlying disk, even if I under-allocate by a fair bit.

This particular one is on a single-disk zpool, and df reports it as:

$ df -h /mnt/storj/a
Filesystem      Size  Used Avail Use% Mounted on
storage/stj-a   3.6T  2.9T  691G  81% /mnt/storj/a

Edit: It’s not trash. The trash folder size is 30GB - less than reported on the dashboard.

Few things to consider:

  1. Check databases for consistency, they may be corrupted
  2. Check filesystem for consistency.
  3. Get filewalker to run, so it accounts for all the stuff correctly
  4. df is not a good measure here because used space quantizes by the sector size and quite a huge majority of the files are very small, so each file on average will waste half the sector size (unless you use filesystem like zfs with compression enabled). Node stores millions of files, and most of them are very small, and the amount of waste can be noticeable (gigabytes). I’ve just checked, compression ratio on my 1TB node is 1.57. Provided that storj data is encrypted and incompressible, had the filesystem not supported the compression, the space usage would have been that much higher. I strongly recommend using appropriate filesystem for the job.
  5. What storagenode displays in the dashboard has no bearing on your payout (it would have been ripe for abuse if it was). Satellite does all the accounting.

My personal approach is to only monitor the node availability at its port and not pay attention to anything on the dashboard. Dashboard is a counterproductive distraction. IN fact, early on storj relied on the similar databases to account for chunks stored, and this turned out to be an unreliable disaster. So this approach is now only used use to keep inconsequential statistics, and chunks are stored in a form of filesystem based CAS. So I choose to ignore the dashboard.

1 Like

It’s already on ZFS, so FS corruption shouldn’t be an issue. I’ll try turning compression on, but I know it won’t apply retroactively. Question is - can I go through each file, and dd it onto itself safely to force zfs to rewrite it? The usual “copy, delete original, move copy over original” isn’t a good idea if the data might be read/written during that.

I restarted the node, that should be enough to get filewalker to run, right?

Also, is the storagenode smart enough to also take the actual FS space limit into account? My concern is that if I have a 3.6TiB drive and tell storj to use 3TiB of it, but I run into an issue like this, will I risk corruption when the 3TiB of actual data tries to use >3.6TiB of physical space?

Hello @mattventura ,
Welcome to the forum!

After check of databases and filesystem you need to allow to run a filewalker. To force it to start you may restart the node. Your node should be online when the satellites sends bloom filter for the Garbage Collector, it should move unused pieces to the trash and then delete them after a week.

To check free space in SI units you need to use df --si. Note that this command shows free disk space, not free space in the allocation.

Copying file by file will take ages.

When I added a special device and needed to “apply” the new configuration to existing data I went through an iterative zfs send | zfs receive approach to catch up then new dataset with the current state, and then stop node, and sync the remaining small amount of difference and continue the node from the new dataset, to minimize downtime to a literally few seconds.

From my notes:

Original dataset node is running from: `source`
New dataset: `target`

- Create recursive snapshot: 
         zfs snapshot -r pool1/source@cloning1

- Send to another new dataset with the node still running: 
         zfs send -Rv pool1/source@cloning1 |  zfs receive pool1/target

- Do it again to catch up changes since previous snapshot: 
        zfs snapshot -r pool1/source@cloning2
        zfs send -Rvi pool1/source@cloning1 pool1/source@cloning2 | zfs receive pool1/target

- Stop the node
        iocage stop storagenode

- Create final snapshot
        zfs snapshot -r pool1/source@cloning3

- Send incremental snapshot
        zfs send -Rvi pool1/source@cloning2 pool1/source@cloning3 | zfs receive pool1/target
  
- Change mount points on the jail
        iocage fstab -e storagenode

- Start the node.
        iocage start storagenode

Yes, if you haven’t disabled it, it should run. You will notice by a high IO pressure on your cache device an CPU usage on the node. I would also still check consistency of the databases (there was an article on how to do that)

Good question. I don’t know. But I personally dont’ rely on that and always have way more space than necessary (at least few extra TB at all times)

Yes it’s smart enough, you may check in the log, it checks free space in the allocation and actual free space on the disk before accepting an offered piece. The another precaution - the node should stop accept pieces if there is less than 500MB in the allocation (or on the disk, what’s smaller).
However, we could introduce a bug, so it’s better to do not allocate the whole free space from the disk.

You may use an amount of free space using the command df -h (it will print it in binary units), but specify this number as an allocation in the node’s config using SI units, it will give you roughly 10% of difference.
So, for 3.6TiB of free disk space you may safely specify 3.6TB (i.e. in SI units) as allocated.

To see a difference between a data usage and the disk usage you may use these commands:
The data usage:

du -s --si --apparent-size /pool1/storj/storagenode/blobs

The actual disk usage:

du -s --si /pool1/storj/storagenode/blobs

Let me guess, you are using zvol and RAIDZ1?

i doubt its a database inconsistency thing.

if i was to hazard a guess your ashift is set to 11 which will make it write 2K blocks while your HDD might only be able to write 4k blocks…

this would lead to data amplification… this or something similar would be my gut shot guess.

please post your disk specs and your zfs pool ashift configuration.

it could also be something like a zvol running 512b on a pool with higher ashift than 9.

ashift doubles the sector size each time.
ashift 9 starts at 512, 10 = 1024, 11 = 2048, 12 = 4096 (4Kn)

I see the issue with both a single-disk pool and a 4-disk RAIDZ1.

They’re both set to 4k ashift, but one of the zpools is made of 4kn drives while the other is 512e drives, so performance would be awful if I used 512 ashift.

4K ashift works fine with 512b drives… because it simply writes across multiple sectors.

data amplification usually results when its the other way around… like say attempting ashift 9 / 512b sector sizes on a 4Kn drive…

then it would write 512b pr 4K sector, thus leading to a 800% write amplification

For the RAIDZ1 this is totally understandable. This has to do with volblocksizes and the parity costs of zvol.

Short explanation: Use datasets instead of zvol, they don’t have fixes block size
Long explanation: RAIDZ — OpenZFS documentation

If you see this issue also with a single disk pool, you messed up ashift.

Sure about that? I thought 512e drives are long gone, expect for some niche enterprise drives?

Yeah, but the other way round, it would lead to parity overhead and fragmentation.
Setting it too low is only a performance problem. Setting it too high is a fragmentation, performance and space efficiency problem. That is why the hypervisors like Proxmox use 8k as a default and not 16k, even though there are lot of +16k read and writes on a VM.

2 Likes

Well, it’s a somewhat old drive, HDN726040ALE614:

NAME   OPT-IO MIN-IO PHY-SEC LOG-SEC MODEL
sdg         0   4096    4096     512 HGST_HDN726040ALE614
├─sdg1      0   4096    4096     512
└─sdg9      0   4096    4096     512

on page 19 says, that the psyical sector size is 4k.
There is a 512 emulation mode for backwards compatibility and a native 4k mode.

I don’t know what ashift you have configured, but does the drive not present itself as 4k by default?

Update: I should have read your output first before googling. Even in your output is shows 4k physical sector.

1 Like

512e drives are everywhere. Exos comes in 512e, Toshiba offers 512e or 4Kn drives, etc.

1 Like

OT: They offer them, yes. But they are not the default.

To get back to the topic, @mattventura if you expect help from the forum, we would need a lot more infos.

Please describe your ashift settigns, your pool settings, the dataset or zvol settings…