Dear Storj, and Mods - this is probably in the wrong place so feel free to move / delete as required.
I’ve been monitoring a few threads about disk allocation / GC and ongoing fixes - I didn’t want to de-rail those, but I’ve got another issue I’ve been monitoring on my nodes that has become more apparent the last 3 months, impacting disk I/O and SQLite DB access…
I’ve been monitoring the usage of expired pieces when a client uploads, and there has been a marked increase on the US1 Sat, with regard the number of pieces being tagged on piece_expiration.db, and the load being generated by the (currently default) hourly chore to purge those files…
As an example, my 10 TB node has this information in the Piece_expiration DB;
2022|46
2023|42019
2024|4427639
9999|37912
My 3TB node has;
2023|1197
2024|378510
9999|13847
My 1TB node has;
2024|144590
9999|6572
As an aside, the 2023 / 2022 data has NULL for satellite_id, deletion_failed_at - and no, there are no delete errors as I don’t believe this condition is handled in the code - although it might be tidy to capture it
but my main point, is the increaased disk I/O I’m seeing from the larger nodes running the hourly chore to purge the expired pieces.
# how frequently expired pieces are collected
# collector.interval: 1h0m0s
monitoring the chore, on the large node the stats (avg approx.) are
Minimum files deleted (not to trash) ~ 5,000
Average files deleted ~ 34,000
Peak files deleted ~ 201,000
…
In the code, there are some hard coded limits, that I assume are there to prevent the Storage node from falling over. The defaults look out, as they would allow 2.4 Million files to be purged as expired in a 24 Hour period, roughly 1.5TB of storage, and this can be problematic for poorly performing disks or systems causing I/O issues - Obviously from the numbers above, we aren’t even close to this “yet” but even at current levels the I/O is starting to become visible.
Also, the collector code isn’t very optimized - I understand, there are other issues to look at, but currently it is a sequential process, every piece is a meta lookup, a disk scan, then an update to Sqlite - I guess this is why SNO’s are having to move DB’s to SSD or NVME, this single DB is very high hitting the larger a node gets.
There is also a 24Hr unpaid discrepancy in the code - obviously not a big deal, but an SNO will be storing an expired piece for 24hrs after is was expired to cover Time zone anomalies, but the satellite only pays up to the expired time - US1 time ?
Also, if the customer decides to delete the piece before the expiration, the chore is still left in the Sqlite db, and gets executed by the node - this just feels weird, usually expiration is used for legal retention, so being able to delete the piece before it expires feels like a missing feature ?
Also, minimum retention ? there feels a very real exploit available in setting the expiration very low, and upload very high. There maybe should be a minimum charge period of a piece with expiration set ? maybe 7 days minimum charge and retention - I really can’t think of a user case where a piece needs to be expired in 24 - 48hrs after upload ?
and really finally…
Why does this collector chore even exist with it’s DB, and the impact on upload speeds due to writing to Sqllite, and the I/O on deletes…
It seems that there will always be a percentage of an SNO’s HDD that will be unpaid - given how the Garbage Collection works currently, and no scope to make it real time.
Why isn’t the who collector expired pieces code deleted, and left to the GC to handle removing the expired pieces - I understand there’s allot of rework there currently but it seems logical to bundle the expired deletes into this process.
From an SNO view point, It would be better than having hourly deletes happening all the time for expired pieces, along with file walker (lazy or not) after every node update, and the garbage collection kicking in, all at the same time !
Also, when you start re-writing the GC code, it would be amazing if the SNO’s could specify a cron style time format in the config to say when we wanted GC to run, after it received the bloom (maybe cache that to disk or DB) , and maybe even a maintence toggle - a file created in the storage directory called maintenance, which when the node saw it, would stop accepting new ingress with a nice graphic in the web ui
Anyhow, this isn’t a complaint - my nodes are working fine, I’m just reporting a new trend on expired pieces that might have a bigger impact on underperforming nodes.
CP