Piece_expiration.db and collector service increased use of self expiring pieces?

Dear Storj, and Mods - this is probably in the wrong place so feel free to move / delete as required.

I’ve been monitoring a few threads about disk allocation / GC and ongoing fixes - I didn’t want to de-rail those, but I’ve got another issue I’ve been monitoring on my nodes that has become more apparent the last 3 months, impacting disk I/O and SQLite DB access…

I’ve been monitoring the usage of expired pieces when a client uploads, and there has been a marked increase on the US1 Sat, with regard the number of pieces being tagged on piece_expiration.db, and the load being generated by the (currently default) hourly chore to purge those files…

As an example, my 10 TB node has this information in the Piece_expiration DB;

2022|46
2023|42019
2024|4427639
9999|37912

My 3TB node has;

2023|1197
2024|378510
9999|13847

My 1TB node has;
2024|144590
9999|6572

As an aside, the 2023 / 2022 data has NULL for satellite_id, deletion_failed_at - and no, there are no delete errors as I don’t believe this condition is handled in the code - although it might be tidy to capture it :slight_smile:

but my main point, is the increaased disk I/O I’m seeing from the larger nodes running the hourly chore to purge the expired pieces.

# how frequently expired pieces are collected
# collector.interval: 1h0m0s

monitoring the chore, on the large node the stats (avg approx.) are

Minimum files deleted (not to trash) ~ 5,000
Average files deleted ~ 34,000
Peak files deleted ~ 201,000

In the code, there are some hard coded limits, that I assume are there to prevent the Storage node from falling over. The defaults look out, as they would allow 2.4 Million files to be purged as expired in a 24 Hour period, roughly 1.5TB of storage, and this can be problematic for poorly performing disks or systems causing I/O issues - Obviously from the numbers above, we aren’t even close to this “yet” but even at current levels the I/O is starting to become visible.

Also, the collector code isn’t very optimized - I understand, there are other issues to look at, but currently it is a sequential process, every piece is a meta lookup, a disk scan, then an update to Sqlite - I guess this is why SNO’s are having to move DB’s to SSD or NVME, this single DB is very high hitting the larger a node gets.

There is also a 24Hr unpaid discrepancy in the code - obviously not a big deal, but an SNO will be storing an expired piece for 24hrs after is was expired to cover Time zone anomalies, but the satellite only pays up to the expired time - US1 time ?

Also, if the customer decides to delete the piece before the expiration, the chore is still left in the Sqlite db, and gets executed by the node - this just feels weird, usually expiration is used for legal retention, so being able to delete the piece before it expires feels like a missing feature ?

Also, minimum retention ? there feels a very real exploit available in setting the expiration very low, and upload very high. There maybe should be a minimum charge period of a piece with expiration set ? maybe 7 days minimum charge and retention - I really can’t think of a user case where a piece needs to be expired in 24 - 48hrs after upload ?

and really finally…

Why does this collector chore even exist with it’s DB, and the impact on upload speeds due to writing to Sqllite, and the I/O on deletes…

It seems that there will always be a percentage of an SNO’s HDD that will be unpaid - given how the Garbage Collection works currently, and no scope to make it real time.

Why isn’t the who collector expired pieces code deleted, and left to the GC to handle removing the expired pieces - I understand there’s allot of rework there currently :smiley: but it seems logical to bundle the expired deletes into this process.

From an SNO view point, It would be better than having hourly deletes happening all the time for expired pieces, along with file walker (lazy or not) after every node update, and the garbage collection kicking in, all at the same time !

Also, when you start re-writing the GC code, it would be amazing if the SNO’s could specify a cron style time format in the config to say when we wanted GC to run, after it received the bloom (maybe cache that to disk or DB) , and maybe even a maintence toggle - a file created in the storage directory called maintenance, which when the node saw it, would stop accepting new ingress with a nice graphic in the web ui :sunglasses:

Anyhow, this isn’t a complaint - my nodes are working fine, I’m just reporting a new trend on expired pieces that might have a bigger impact on underperforming nodes.

CP

:heart:

7 Likes

Thanks, there is a lot to unpack here. I can say that there are some changes going in soon for trash management to reduce significant I/O.

As for why someone would upload something with a 24hr delete period. It really depends on the use case. I could see a monitoring system keeping 24hr file storage daily, and then having a weekly, monthly, or yearly storage. All depends on how they want to backup their data.

As to why some decisions are made on how deletes are handled, I will ask an engineer to take a look at your questions, but typically the choices are made to provide the end user maximum performance, and the nodes can sometimes have less optimal processes because of that. This is stuff that the team looks at and discusses for improving nodes overall performance as it becomes apparent we have an issue. Like anything though, major changes to how something is done takes a lot of discussion and planning. Anyway, I will see if someone can answer your other questions.

On the topic of this change, as I cannot comment there due to lack of Gerrit account :cry: I believe this might be unnecessary complexity. I’m running my nodes with a much smaller, simpler change that achieves roughly the same effect. Wanted to submit it soon after I make sure it works for my nodes, but I guess I can just as well submit a draft PR… storagenode/{blobstore,pieces}: remember the earliest timestamp of a trash piece by liori · Pull Request #6789 · storj/storj · GitHub

Feel free to discard it though, the change in Gerrit also has its merits.

5 Likes

I think that’s entirely reasonable. For my own part, I wasn’t aware that the expired pieces collector was causing any issues for anyone. (Maybe some of us were aware; I’m just saying I wasn’t.) The collector only exists so that nodes can opportunistically remove pieces that it knows it doesn’t need to store anymore, without waiting for the next GC bloom filter. If that isn’t helping, let’s get rid of it.

Maybe we could even get rid of piece_expiration.db entirely, also saving time for the trash emptying filewalker, which has to try and remove expiration times for every piece it deletes.

I would love to be able to show maintenance tasks like this in the web UI, and allow (some of) them to be started or stopped manually. It’s just a matter of having resources allocated for doing the work. I think we’ll get there at some point. A cron-style task definition seems less probable, as we would generally prefer to avoid large sections of Unix-specific code.

4 Likes

I don’t see any problems either on my nodes. If anything, piece expiration is a more efficient mechanism. GC would need to first move pieces to trash, only then remove them. Piece expiration saves I/O time.

And so, well, if a node gets a lot of pieces with expiration date set and complains about I/O, using GC to clean them up will make the I/O worse.

If there is an I/O problem with piece expiration, I would seek to solve it, not remove the mechanism altogether. Quickly looking at the code, there’s one potential improvement to be tried: running a single DELETE statement to remove entries of multiple pieces in a single transaction, as opposed to removing them one by one.

2 Likes

I agree and would say all these different types of deletes the node can do only adds to code complexity. There should be one and only way to delete files from the node and GC looks like a reasonable way to do that, given it can now run as a best effort job.
In the future maybe there will be an option to change expiration time of an already uploaded file (if this isn’t a thing already), and then you would have to implement this to satellite and also the node code.
I was also reading some of the nodes are running with quite a big clock skew, which would imply pieces of the same file are being expired at a different times.
Just keep it simple and delete everything using the bloom filters and GC and get rid of everything else.

And to add, you can’t recover files expired by the node as those aren’t moved to trash, but deleted directly as I understand. Maybe in the future there will be an offering with the ability to recover files in the next N days, as they still are in the trash. In such case you won’t be able to recover the ones expired by the node and this would be another reason to unify how the files are being deleted on the node.

The expiration date is a part of the object’s Metadata, so to alter it you need to reupload the object with a new expiration date.

1 Like

I thought about this too, even mentioned it on the forum, but then I realised that…: from a SNO point of view this would be ok, to stop the ingress when some intensive walkers run; from network’s point of view this would be unacceptable. Imagine sending a bloom filter and suddenly the entire available space on the network is gone, because the nodes have started colecting the garbage and they stopped the ingress.
The clients would be locked out of uploads.

1 Like

By the way, it was supposed to be that anyone could log into our Gerrit using an existing Github account. If that’s not working, let us know.

So, what I wrote is technically wrong, sorry. I do have an account, but I do not have permissions.

The wiki on Github says (emphasis mine):

Before requesting Gerrit access in Slack, you will need to log in at https://review.dev.storj.io/ with your Github account. Once you have done this, share your Github email and username, and ask for review and submit permissions in the #gerrit Slack channel.

Interesting, it allows to see the change even without a GitHub account. I believe this message means that you want to contribute, it’s not required for viewing.

If you want to just comment then you would need to sign in to Gerrit using github. You can then comment on any part of the code.

Here’s how it looks.

2 Likes