Update used space using garbage collection

Ambifacient · July 11, 2024, 3:37am

When a node obtains a bloom filter from a satellite, it will walk over all the pieces in the satellite, check the piece’s modification time, membership in the bloom filter, and decide to move it to trash if so. In fact, the retain process also returns the number of pieces in the satellite, as evidenced by log entries on successful completion.

GC is pretty much a non-negotiable thing for node operators, while some disable the used-space filewalker to avoid the IOPS as it is entirely possible to run a node without accurate used-space information.

GC also runs usually once a week, whereas for used-space it is triggered only on startup of nodes, which for long running nodes means every update.

Therefore I propose that the retain process be also used to update the used-space for the satellite it is collecting garbage for. There is definitely some implementation details to discuss, such as handling cases where retain resumes from progress, but the idea is to leverage work that the node is already doing to provide quicker information update.

This may be less less important if a stable implementation of the file-stat-cache becomes default for nodes.

Also side question if anyone knows: Does the used-space filewalker just ignore increase in used space by ingress, if those pieces are stored in prefixes that the filewalker already visited?

Alexey · July 11, 2024, 4:43am

It’s already updates the used-space after successful finishing of the retain process and if no errors related to databases during that time.

Yes it’s, because the upload process already updated the database for the now used space by this piece.
It also ignores the delete of the piece, because the collector should update the database too (in the current released version has a bug though, the collector doesn’t updates the database, it’s fixed only in 1.108.x).
It also ignores moved pieces to the trash by the retain process, because it updates the databases on the successful finish too.
It also ignores pieces removed from the trash, because the trash filewalker updates the databases on the successful finish too.
However, this information is added to the databases not immediately but every 1h by default

      --storage2.cache-sync-interval duration                    how often the space used cache is synced to persistent storage (default 1h0m0s)

jammerdan · July 11, 2024, 5:41am

--storage2.cache-sync-interval duration

This controls the database updates for all processes you have mentioned including the upload process? So change from uploads we see only once every hour with default setting?

Alexey · July 11, 2024, 6:19am

Likely yes, but i didn’t check the code.

littleskunk · July 11, 2024, 8:37am

The order is slightly different. The membership check comes first. Only for the pieces that are not member of the bloom filter the modification time and piece size will consume additional iops.

This means that garbage collection can track the size of the pieces it has moved but not for the remaining pieces that have not been moved into the trash folder. We could change that but it will consume more iops and slow down garbage collection.