Real free disk space safety mechanism is not reliable

I had a look at the storage node source code to verify that the check against filling up a file system makes sense, and I have some concerns that it may not prevent storage node from failing. For reference, I’m using the following pieces of code to explain: endpoint.go:Upload, monitor.go:AvailableSpace.

The formula for available space as computed by the monitor code is:

freeSpaceForStorj := min(disk free space, min(total disk space, allocated disk space) - space used by pieces)

The upload code will not accept uploads if this value is smaller than the limit order’s size limit, and it will by default send a low disk space notification to the satellite (which should “quickly” prevent new uploads) when below 5 GB.

Let’s deal with the latter first. What happens if this notification fails? After all, one of the fallacies of distributed computing means we cannot expect network to be reliable. Turns out, we simply log an error, and then wait until cooldown period (default: 10 minutes) finishes to send another one. Which may be too late given the recent tests—especially if the notification fails exactly because of network already saturated by the same tests! So we cannot fully depend on the satellite to prevent uploads. Hence for the purposes of a worst-case analysis we can just ignore the notification altogether.

Assume the scenario where we are within 1 MB of totally filling the disk, we overallocated, and the space used by pieces is severely underestimated:

freeSpaceForStorj := min(1 MB, min(10 TB, 20 TB) - 500 GB) == 1 MB.

. So far, good—we won’t accept any uploads bigger than 1 MB.

Let’s consider two scenarios:

Scenario 1: What happens though if there is a request to upload a piece sized exactly 1 MB?

We need to write 1 MB for data + 512 bytes for the piece header + at least 60 bytes for the direntry (depends on the file system) + potentially some additional space in MBR/inode (depends on the file system).

Oops. Failure. The piece upload will fail.

Scenario 2: A bloom filter comes. The node wants to create a new date-named directory, then 1024 subdirectories. We need to allocate at least one sector to keep direntries for each subdirectory, which amounts to 4 KB × 1024 = 4 MiB.

Oops. Failure. We cannot even garbage-collect existing pieces to free up disk space!

There are other types of data that needs to be written during node operation:

  • Databases, orders (if not moved to separate storage),
  • Moving a file to trash requires adding information to the target trash directory, maybe even create one. This does not mean the size of the source directory gets smaller (e.g., ext4 never reduces directory size), this is not just reassigning those 60 bytes of storage used for a direntry to another directory.
  • log-structured file systems require a certain amount of disk space even for directly deleting a file, e.g., exactly when trying to remove an upload that filled the disk space!

My thoughts:

  • My understanding is that real free disk space safety mechanism should hard-stop uploads when there is less than some tens of megabytes of free disk space. I would keep it safe and stop at 5 GB, just like the satellite stopping sending uploads at that amount.
  • Node operators who observed that the disk has been filled to brim by a node, please check whether you have an log entry with one of the following messages close to the last successful uploads: error during updating node information, or error notifying satellites.
3 Likes

Is that 5GB stop something tunable? I’ve been fiddling with STORAGE flags to fine-tune free space… but I’d really just like a way to say “use as much space as you want as long as the OS says there’s 100GB free”.

2 Likes

It is… if you have access to satellite’s configuration :stuck_out_tongue:

1 Like

wow (thanks for looking at the code) I actually thought the 5GB space was a thing on the node side where it would just reject new files.

But if it relies on communication with the satellites then… well… yeah…

1 Like

I also thought that 5GB was a kill-switch on the node side.