Disk usage discrepancy?

Yeah, I believe this should be documented better. @Alexey, please consider the following text as public domain/CC0 and if you find the text below accurate (please verify!), feel free to edit it accordingly and incorporate it in storage node documentation.

Average Disk Space Used This Month is the sum of bytes of pieces that the satellite considers as stored. This is what you get paid for.

Used Disk Space is the sum of bytes of pieces that the node believes it is supposed to store. This includes files that were deleted from the satellite’s perspective, but not yet garbage-collected by your node. Garbage collection happens once a week per satellite. This number is updated during garbage collections and during the file walker executed each time your node is started (unless you disabled it). Sometimes the node cannot update these numbers properly, for example when the I/O load is so high that the bandwidth.db file becomes locked, and as such this number may be imprecise. It may also include outdated satellites, though I admit I don’t know how true this is. If you see a large discrepancy here, please make sure the initial file walker is enabled (this is the default), restart your node, and wait until the file walker finishes. And if you see that your node still contains data for satellites no longer trusted, run the forget satellite procedure.

On my nodes the difference between the last Average Disk Space Used This Month value and the Used Disk Space number are usually on the order of low hundreds of gigabytes on an active node, but it did happen sometimes that I’ve seen larger discrepancies. As such, I’m not really surprised that you see this kind of difference, but if it stays at this level for a month, then it would be suspicious.

Both of the above numbers are reported using SI units, i.e. 1kB = 1000 bytes, 1MB = 1000000 bytes, and so on.

Each piece is a separate file on disk. For each file your file system needs to store some metadata (file name, location on disk, attributes, permissions, and so on, usually between 150 and ~1 kB per file). There is also some additional overhead involved: the file itself is usually rounded up to full sectors/clusters (usually 4 kB, sometimes as large as 2 MB though!), as in most file systems a single cluster cannot be shared between two files, even if the file contents do not fill it. The specific numbers depend on the file system used.

I would expect the total overhead of 10 TB worth of node data, assuming around 36 M files, to be:

  • around 85 GB for a typical ext4 file system,
  • at least 110 GB for NTFS with cluster size of 4 kB,
  • at least 340 GB for NTFS with cluster size of 16 kB,
  • at least 2.4 TB for default-formatted exFAT (cluster size of 128 kB; by the way, this is a nice description of the relation of cluster size to space efficiency in exFAT).

(Though I admit I do not enough about NTFS/exFAT to accurately estimate its overhead).

In addition to that, some operating systems report file system usage using IEC 60027-2 units, i.e. 1 kiB = 1024 bytes, 1 MiB = 1048576 bytes. Sometimes even they do that while using SI names (kB instead of kiB, MB instead of MiB, etc.) which adds to confusion. Here I’m only using SI units.

The number for exFAT kinda matches your observations, so… are you using exFAT?