We’ve recently got a large number of problems that require a full run of a used space file walker, and we have a large number of posts complaining about its duration. We have also seen that its duration is also a source of problems—not all changes to storage while the file walker is running are accounted.
One key observation that was made is that the measured values of disk space do not have to be fully accurate, and it should be acceptable to have a decent approximation. This suggests that a statistical approach could be of use here.
I would like to propose the following procedure to replace the current used space file walker:
- Let
min_pieces_to_collect
be a operator-configurable value, default 1 million. - Iterate over the two-letter subdirectories of the
blobs
directory in a random order. For each subdirectory:
a. Collect total size of pieces and number of pieces in this directory.
b. If the total number of pieces visited across all subdirectories so far reachedmin_pieces_to_collect
, stop iterating. - Let
subdirectories_visited
be the number of subdirectories visited before stopping the loop 2. - Let
total_size_of_pieces_visited
be the total size of all pieces visited in all subdirectories before stopping the loop 2. - Estimate disk space used as:
total_size_of_pieces_visited × 1024 / subdirectories_visited
This algorithm has the following properties:
- It limits the number of pieces visited. Worst case of extremely large nodes (we’re not there yet) we would visit only a single directory, so 1/1024 of the current work. But for nodes which store between
min_pieces_to_collect
and1024 × min_pieces_to_collect
(default: ~1 billion pieces!) the workload will roughly stay the same regardless of the number of pieces. - The relative estimation error does not change much with the number of pieces that the node actually stores. The error depends only on the number of pieces visited, which will be between
min_pieces_to_collect
and2 × min_pieces_to_collect
. A useful analogy is that if you cook a soup, a single spoon is enough to check the taste regardless of the size of the pot—as long as the soup is stirred well. Randomness of piece IDs is a good way to stir the soup, and picking directories at random should also help in an unlikely case of a single directory being affected by some data loss.
I did some simulations to estimate mean absolute percentage error (MAPE). I took sizes of pieces on one of my nodes to have an proper empirical distribution of piece sizes, then simulated 1000 nodes of each size storing these pieces. Numbers show quantiles of MAPE for different min_pieces_to_collect
values:
For simulated nodes of 500 GB:
min_pieces_to_collect \ quantile |
90% | 99% | 99.5% |
---|---|---|---|
10,000 | 4.44% | 6.73% | 7.25% |
100,000 | 1.36% | 2.06% | 2.27% |
1,000,000 | 0.35% | 0.55% | 0.62% |
10,000,000 | 0 | 0 | 0 |
(note: there was less than 10M pieces, so all pieces were visited in the last row)
For simulated nodes of 1 TB:
min_pieces_to_collect \ quantile |
90% | 99% | 99.5% |
---|---|---|---|
10,000 | 4.63% | 6.98% | 7.33% |
100,000 | 1.39% | 2.20% | 2.32% |
1,000,000 | 0.41% | 0.63% | 0.67% |
10,000,000 | 0 | 0 | 0 |
(note: there was less than 10M pieces, so all pieces were visited in the last row)
For simulated nodes of 2 TB:
min_pieces_to_collect \ quantile |
90% | 99% | 99.5% |
---|---|---|---|
10,000 | 4.46% | 7.09% | 7.45% |
100,000 | 1.43% | 2.48% | 2.66% |
1,000,000 | 0.42% | 0.67% | 0.77% |
10,000,000 | 0.03% | 0.05% | 0.05% |
For simulated nodes of 5 TB:
min_pieces_to_collect \ quantile |
90% | 99% | 99.5% |
---|---|---|---|
10,000 | 2.84% | 4.46% | 4.74% |
100,000 | 1.41% | 2.20% | 2.41% |
1,000,000 | 0.45% | 0.72% | 0.81% |
10,000,000 | 0.11% | 0.17% | 0.18% |
For simulated nodes of 10 TB:
min_pieces_to_collect \ quantile |
90% | 99% | 99.5% |
---|---|---|---|
10,000 | 2.03% | 3.04% | 3.38% |
100,000 | 1.34% | 2.24% | 2.34% |
1,000,000 | 0.44% | 0.70% | 0.75% |
10,000,000 | 0.13% | 0.20% | 0.22% |
For simulated nodes of 20 TB:
min_pieces_to_collect \ quantile |
90% | 99% | 99.5% |
---|---|---|---|
10,000 | 1.40% | 2.21% | 2.43% |
100,000 | 1.40% | 2.21% | 2.43% |
1,000,000 | 0.45% | 0.69% | 0.74% |
10,000,000 | 0.14% | 0.20% | 0.22% |
(note: a single subdirectory contains more than 100k pieces, so the numbers for 10k and 100k min_pieces_to_collect
will be the same)
For simulated nodes of 50 TB:
min_pieces_to_collect \ quantile | 90% | 99% | 99.5% |
---|---|---|---|
10,000 | 0.88% | 1.36% | 1.42% |
100,000 | 0.88% | 1.36% | 1.42% |
1,000,000 | 0.45% | 0.70% | 0.76% |
10,000,000 | 0.14% | 0.22% | 0.24% |
(note: a single subdirectory contains more than 100k pieces, so the numbers for 10k and 100k min_pieces_to_collect
will be the same)
For simulated nodes of 100 TB:
min_pieces_to_collect \ quantile | 90% | 99% | 99.5% |
---|---|---|---|
10,000 | 0.64% | 0.95% | 1.08% |
100,000 | 0.64% | 0.95% | 1.08% |
1,000,000 | 0.45% | 0.69% | 0.76% |
10,000,000 | 0.14% | 0.21% | 0.24% |
(note: a single subdirectory contains more than 100k pieces, so the numbers for 10k and 100k min_pieces_to_collect
will be the same)
I opted to evaluate an approach that would be the simplest to implement right now. Some potential refinements are:
- Pick at least N directories, for N > 10. This would help with potential single-directory losses. But would require scanning too much data for large nodes.
- Scan only a specific fraction of a directory. This would compensate point 1. and still reduce the workload for very large nodes, but would make estimates more complex.