On a statistical approach for the used space file walker

Toyoo · July 13, 2024, 3:09pm

We’ve recently got a large number of problems that require a full run of a used space file walker, and we have a large number of posts complaining about its duration. We have also seen that its duration is also a source of problems—not all changes to storage while the file walker is running are accounted.

One key observation that was made is that the measured values of disk space do not have to be fully accurate, and it should be acceptable to have a decent approximation. This suggests that a statistical approach could be of use here.

I would like to propose the following procedure to replace the current used space file walker:

Let min_pieces_to_collect be a operator-configurable value, default 1 million.
Iterate over the two-letter subdirectories of the blobs directory in a random order. For each subdirectory:
a. Collect total size of pieces and number of pieces in this directory.
b. If the total number of pieces visited across all subdirectories so far reached min_pieces_to_collect, stop iterating.
Let subdirectories_visited be the number of subdirectories visited before stopping the loop 2.
Let total_size_of_pieces_visited be the total size of all pieces visited in all subdirectories before stopping the loop 2.
Estimate disk space used as: total_size_of_pieces_visited × 1024 / subdirectories_visited

This algorithm has the following properties:

It limits the number of pieces visited. Worst case of extremely large nodes (we’re not there yet) we would visit only a single directory, so 1/1024 of the current work. But for nodes which store between min_pieces_to_collect and 1024 × min_pieces_to_collect (default: ~1 billion pieces!) the workload will roughly stay the same regardless of the number of pieces.
The relative estimation error does not change much with the number of pieces that the node actually stores. The error depends only on the number of pieces visited, which will be between min_pieces_to_collect and 2 × min_pieces_to_collect. A useful analogy is that if you cook a soup, a single spoon is enough to check the taste regardless of the size of the pot—as long as the soup is stirred well. Randomness of piece IDs is a good way to stir the soup, and picking directories at random should also help in an unlikely case of a single directory being affected by some data loss.

I did some simulations to estimate mean absolute percentage error (MAPE). I took sizes of pieces on one of my nodes to have an proper empirical distribution of piece sizes, then simulated 1000 nodes of each size storing these pieces. Numbers show quantiles of MAPE for different min_pieces_to_collect values:

For simulated nodes of 500 GB:

`min_pieces_to_collect` \ quantile	90%	99%	99.5%
10,000	4.44%	6.73%	7.25%
100,000	1.36%	2.06%	2.27%
1,000,000	0.35%	0.55%	0.62%
10,000,000	0	0	0

(note: there was less than 10M pieces, so all pieces were visited in the last row)

For simulated nodes of 1 TB:

`min_pieces_to_collect` \ quantile	90%	99%	99.5%
10,000	4.63%	6.98%	7.33%
100,000	1.39%	2.20%	2.32%
1,000,000	0.41%	0.63%	0.67%
10,000,000	0	0	0

(note: there was less than 10M pieces, so all pieces were visited in the last row)

For simulated nodes of 2 TB:

`min_pieces_to_collect` \ quantile	90%	99%	99.5%
10,000	4.46%	7.09%	7.45%
100,000	1.43%	2.48%	2.66%
1,000,000	0.42%	0.67%	0.77%
10,000,000	0.03%	0.05%	0.05%

For simulated nodes of 5 TB:

`min_pieces_to_collect` \ quantile	90%	99%	99.5%
10,000	2.84%	4.46%	4.74%
100,000	1.41%	2.20%	2.41%
1,000,000	0.45%	0.72%	0.81%
10,000,000	0.11%	0.17%	0.18%

For simulated nodes of 10 TB:

`min_pieces_to_collect` \ quantile	90%	99%	99.5%
10,000	2.03%	3.04%	3.38%
100,000	1.34%	2.24%	2.34%
1,000,000	0.44%	0.70%	0.75%
10,000,000	0.13%	0.20%	0.22%

For simulated nodes of 20 TB:

`min_pieces_to_collect` \ quantile	90%	99%	99.5%
10,000	1.40%	2.21%	2.43%
100,000	1.40%	2.21%	2.43%
1,000,000	0.45%	0.69%	0.74%
10,000,000	0.14%	0.20%	0.22%

(note: a single subdirectory contains more than 100k pieces, so the numbers for 10k and 100k min_pieces_to_collect will be the same)

For simulated nodes of 50 TB:

min_pieces_to_collect \ quantile	90%	99%	99.5%
10,000	0.88%	1.36%	1.42%
100,000	0.88%	1.36%	1.42%
1,000,000	0.45%	0.70%	0.76%
10,000,000	0.14%	0.22%	0.24%

(note: a single subdirectory contains more than 100k pieces, so the numbers for 10k and 100k min_pieces_to_collect will be the same)

For simulated nodes of 100 TB:

min_pieces_to_collect \ quantile	90%	99%	99.5%
10,000	0.64%	0.95%	1.08%
100,000	0.64%	0.95%	1.08%
1,000,000	0.45%	0.69%	0.76%
10,000,000	0.14%	0.21%	0.24%

(note: a single subdirectory contains more than 100k pieces, so the numbers for 10k and 100k min_pieces_to_collect will be the same)

I opted to evaluate an approach that would be the simplest to implement right now. Some potential refinements are:

Pick at least N directories, for N > 10. This would help with potential single-directory losses. But would require scanning too much data for large nodes.
Scan only a specific fraction of a directory. This would compensate point 1. and still reduce the workload for very large nodes, but would make estimates more complex.

Ruskiem · July 13, 2024, 4:13pm

Will this correctly estimate occupied disk space? the main problem is time that iterating takes over say 20, 50 million pieces of size from 1KB to 2MB. (Just iterating all that can take like 6-10 days for a full 14TB disk under windows for example.)

does that system can help here?
Please forgive me all my lacks,
i just still don’t get it, how statistics can predict all the remaining files size just by sample of 1/100 of the soup, i would love it to work tho!

i would like to run some simulation on my real storj’s folder if You got some scripts to run i would love to try.

Toyoo · July 13, 2024, 4:56pm

With the default value of min_pieces_to_collect I’m suggesting you will get an estimate that in 99.5% cases will be less 14 TB × 0.7% = 100 GB away from the true value. I believe this will be good enough as a number to be displayed on the dashboard.

No scripts, sorry, unless you want to learn the R programming language.

Ruskiem · July 13, 2024, 5:12pm

so at the cost of just one run with only 1m pieces, it can estimate a 10m pieces node total TB weight to a 99.3% accuracy? (in 99,5% of cases)

That would be a 1/10 of the work needed to establish a fairly correct node’s total used space.
Combined with potential rise of that emergency disk cap that is at 5GB atm, to 100GB, could secure nodes to not get more space, and cut the time of used disk filewalker by 90%, which would be HUGE improvement.

We don’t need 100% accurate disk space used in dashboard to properly function. It only blocks the node from taking new files, when in fact it has free space, and the penalty to unlock ingress, is to do a full filewalker run which takes I/O for 6-10 days, that could be used for real work! Or Garbage Collector! YES We need that i/o for GC-Collector ! not for used-space-filewalker!
@elek

elek · July 13, 2024, 6:52pm

Good idea, but it’s very similar what I would do:

I assume, still we need to run the space walker at regular cadence. Probably it’s good idea to calculate different 10% every time.

That’s exactly what I proposed: create a database for each prefix directories (we have 256) and save the results of the space usage per prefix.

It supports stop / resume (you can continue from any prefix)
It can have a freshness field, and you can skip re-checking directories, if they are fresh enough
which almost the same what you propose, just we can cache the historical data, and slowly recalculate the percentages…

Toyoo · July 13, 2024, 7:42pm

You will underestimate during heavy uploads, and overestimate in silent time. Yet the I/O will still be there, even a perfectly stable node will have to visit all pieces.

jammerdan · July 14, 2024, 8:55am

The current frequency of running the space calculation is excessive. Ideally, it should be executed only once, and then updates should be made based on uploads and deletions that take place.

Let’s say if a full filewalker scan to take 5 days, I think SNOs would be happy to accept this as long as it would not have to run again for the next 5 months.

Main issues are:

The frequency of runs is too high.
The time it takes to produce results is too long. To address this, an approach that provides projections while the calculation is in progress would be a significant improvement, eliminating the need to wait for the entire process to complete.

Alexey · July 14, 2024, 9:42am

The frequency is matched with the frequency of restarts. If your node didn’t crash, the usual default frequency is once per 2 weeks, however, it can be disabled.
All other operations supposedly should update the databases and doesn’t require the scan on start at all. However, if we have a bug (several… ), or if you have a side usage (like, you know, use only what’s you already have and it’s likely used by your services too…), when the node cannot update a database with that side usage.

You may disable the lazy mode and it will be finished faster.

snorkel · July 14, 2024, 8:03pm

If the OS can keep track of all files on drive, with the already implemented filesystem and metadata/inodes, why we don’t use that to account for all pieces? This is still a big enigma to me. You could also store the piece size in the metadata, if that is not stored in the normal OS operation. So you would only need to read the filesystem and metadata, without going through all the pieces on drive.
You could also store some hash/verification code in there for each file, to verify it’s autenticity.

Pentium100 · July 14, 2024, 8:51pm

Because it is slower/less efficient than the proposed solutions. Right now, filewalker does use the OS like you said. It goes over the directories and files, calculates the used space etc. The problem is that it takes a long time, especially on a large node (it took about 2 days to run on my node with 50TB of data in it). I don’t know how many millions of files this is, but it just takes a long time to count them all.

running du on the directories would likely take the same amount of time,

However, the OS is designed to be universal, the way the filesystem stores data has to be good enough for pretty much any use. If you know what your use is, you can optimize things. For example, as OP showed, because of the way the node stores pieces in the directories, you can check a small portion of them and have a pretty reasonable estimate of the used space.
Another possibility is to have a database with the directory sizes in it and if the directory was not modified since the last check, you can just take the size from the database.

mgonzalezm · July 14, 2024, 11:47pm

In Linux (and maybe other POSIX OSes) you can get the filesystem’s used/free space at ZERO cost using df. This info is updated real-time by the OS.

When the whole filesystem is used for a storage node maybe this method could be used, subtracting from the total, the space used by the other dirs outside the blobs path that usually contains a few files.

A flag in the config file could be used to indicate that the whole FS is only used by the node and switch to this method.

$ df --block-size=1 /mount/point

Filesystem          1B-blocks          Used     Available Use% Mounted on
/dev/device      11974883188736 8398533345280 3576333066240  71% /mount/point

jammerdan · July 15, 2024, 5:04am

Yes I know that. But if the node gets restarted every 2 weeks and need 5 days for a full run this is not a good idea and not acceptable.

I know changes are still in progress: make used-space calculation frequency configurable
and I hope we see many configuration options like suggested:

Alexey · July 15, 2024, 5:49am

This applicably only for the case, when you use the whole drive for the node.
But it’s not always so. For example, I have three nodes, they run each on the own disk, but I have my files there too and they are used for other services, which are run there too.
I just allocated a free part of the disk, not the whole disk. So for my nodes the stat of df would be wrong.

jammerdan · July 15, 2024, 6:38am

What would be required is a working and resilient ledger that counts the bytes that get uploaded and the ones that get moved and deleted. If this would work in a reliable way I don’t see the need for walking all files regularly for the only purpose of calculating space. At least not that frequent as it is currently.

Alexey · July 15, 2024, 6:45am

If you do not have a database or filewalker errors and data is showing correctly on the dashboard, you may disable the scan on startup.
All remaining operations should update the databases regularly (first caching in the memory and flushing to the databases every hour by default).

Ruskiem · July 15, 2024, 7:16am

oh, and does it flush to database at restarts/stopping?

Alexey · July 15, 2024, 7:28am

In case of the graceful stop/restart it should.

mgonzalezm · July 15, 2024, 4:37pm

Yes, that is what I said, nevertheless I think this would be useful for the majority of SNOs.

This code snippet shows how to get the info in C++ using ONLY the standard library without any dependencies , so probably it works even in Windows (C++ being portable and all that).

something similar should be possible in golang.

#include <iostream>
#include <filesystem>

int main()
{
    std::filesystem::space_info tmp = std::filesystem::space("/mount/point/");

    std::cout << "Capacity: " << tmp.capacity << '\n'
              << "Free space: " << tmp.free << '\n'
              // available to a non-privileged process
              << "Available space: " << tmp.available << '\n';
}

Toyoo · July 15, 2024, 10:11pm

Which is what bandwidth.db was supposed to be. It isn’t, and I sort of think this might not actually be possible under hardware constraints. Resiliency requires IOPS, which is scarce on many setups.

Alexey · July 16, 2024, 4:28am

Yes, but then we need to change requirements, and require to provide the whole drive or use quotas. However, in case of Windows it would be required to run the storagenode service under a different user (I do not think that’s a good idea to set quotas on SYSTEM user, which is currently used for the storagenode service and many others).
Or even worse - use partitions (which will reduce available IOPS too) or virtual disks.