I don’t think this is the usual “SI vs decimal” issue because the discrepancy is too big.
One of my nodes is set to a 2TB allocation, and it has filled up most of that. First question, why does the graph on the left disagree with the pie chart to such a large degree? Even subtracting out trash, that doesn’t add up.
Secondly, the larger issue: df -h reports the usage as 2.9TB. Where is the extra usage coming from? I tried to run du, but there’s just so many files that du takes an astronomical amount of time to go through either the trash or blobs folder. 5/6 of my nodes are like this. The other four have 4-5TB allocations, 1.2-1.4TB used according to the dashboard (slightly lower on the graph), but df reports 1.8-2.1TB used for each one. This is a major problem for me, because the payouts page seems to imply that I’m getting paid for the lower value. It’s also a problem because I risk running out of space on the underlying disk, even if I under-allocate by a fair bit.
This particular one is on a single-disk zpool, and df reports it as:
$ df -h /mnt/storj/a
Filesystem Size Used Avail Use% Mounted on
storage/stj-a 3.6T 2.9T 691G 81% /mnt/storj/a
Edit: It’s not trash. The trash folder size is 30GB - less than reported on the dashboard.
Check databases for consistency, they may be corrupted
Check filesystem for consistency.
Get filewalker to run, so it accounts for all the stuff correctly
df is not a good measure here because used space quantizes by the sector size and quite a huge majority of the files are very small, so each file on average will waste half the sector size (unless you use filesystem like zfs with compression enabled). Node stores millions of files, and most of them are very small, and the amount of waste can be noticeable (gigabytes). I’ve just checked, compression ratio on my 1TB node is 1.57. Provided that storj data is encrypted and incompressible, had the filesystem not supported the compression, the space usage would have been that much higher. I strongly recommend using appropriate filesystem for the job.
What storagenode displays in the dashboard has no bearing on your payout (it would have been ripe for abuse if it was). Satellite does all the accounting.
My personal approach is to only monitor the node availability at its port and not pay attention to anything on the dashboard. Dashboard is a counterproductive distraction. IN fact, early on storj relied on the similar databases to account for chunks stored, and this turned out to be an unreliable disaster. So this approach is now only used use to keep inconsequential statistics, and chunks are stored in a form of filesystem based CAS. So I choose to ignore the dashboard.
It’s already on ZFS, so FS corruption shouldn’t be an issue. I’ll try turning compression on, but I know it won’t apply retroactively. Question is - can I go through each file, and dd it onto itself safely to force zfs to rewrite it? The usual “copy, delete original, move copy over original” isn’t a good idea if the data might be read/written during that.
I restarted the node, that should be enough to get filewalker to run, right?
Also, is the storagenode smart enough to also take the actual FS space limit into account? My concern is that if I have a 3.6TiB drive and tell storj to use 3TiB of it, but I run into an issue like this, will I risk corruption when the 3TiB of actual data tries to use >3.6TiB of physical space?
After check of databases and filesystem you need to allow to run a filewalker. To force it to start you may restart the node. Your node should be online when the satellites sends bloom filter for the Garbage Collector, it should move unused pieces to the trash and then delete them after a week.
To check free space in SI units you need to use df --si. Note that this command shows free disk space, not free space in the allocation.
When I added a special device and needed to “apply” the new configuration to existing data I went through an iterative zfs send | zfs receive approach to catch up then new dataset with the current state, and then stop node, and sync the remaining small amount of difference and continue the node from the new dataset, to minimize downtime to a literally few seconds.
From my notes:
Original dataset node is running from: `source`
New dataset: `target`
- Create recursive snapshot:
zfs snapshot -r pool1/source@cloning1
- Send to another new dataset with the node still running:
zfs send -Rv pool1/source@cloning1 | zfs receive pool1/target
- Do it again to catch up changes since previous snapshot:
zfs snapshot -r pool1/source@cloning2
zfs send -Rvi pool1/source@cloning1 pool1/source@cloning2 | zfs receive pool1/target
- Stop the node
iocage stop storagenode
- Create final snapshot
zfs snapshot -r pool1/source@cloning3
- Send incremental snapshot
zfs send -Rvi pool1/source@cloning2 pool1/source@cloning3 | zfs receive pool1/target
- Change mount points on the jail
iocage fstab -e storagenode
- Start the node.
iocage start storagenode
Yes, if you haven’t disabled it, it should run. You will notice by a high IO pressure on your cache device an CPU usage on the node. I would also still check consistency of the databases (there was an article on how to do that)
Good question. I don’t know. But I personally dont’ rely on that and always have way more space than necessary (at least few extra TB at all times)
Yes it’s smart enough, you may check in the log, it checks free space in the allocation and actual free space on the disk before accepting an offered piece. The another precaution - the node should stop accept pieces if there is less than 500MB in the allocation (or on the disk, what’s smaller).
However, we could introduce a bug, so it’s better to do not allocate the whole free space from the disk.
You may use an amount of free space using the command df -h (it will print it in binary units), but specify this number as an allocation in the node’s config using SI units, it will give you roughly 10% of difference.
So, for 3.6TiB of free disk space you may safely specify 3.6TB (i.e. in SI units) as allocated.
To see a difference between a data usage and the disk usage you may use these commands:
The data usage:
du -s --si --apparent-size /pool1/storj/storagenode/blobs
If you see this issue also with a single disk pool, you messed up ashift.
Sure about that? I thought 512e drives are long gone, expect for some niche enterprise drives?
Yeah, but the other way round, it would lead to parity overhead and fragmentation.
Setting it too low is only a performance problem. Setting it too high is a fragmentation, performance and space efficiency problem. That is why the hypervisors like Proxmox use 8k as a default and not 16k, even though there are lot of +16k read and writes on a VM.