Piece_expiration.db and collector service increased use of self expiring pieces?

CutieePie · February 20, 2024, 9:12pm

Dear Storj, and Mods - this is probably in the wrong place so feel free to move / delete as required.

I’ve been monitoring a few threads about disk allocation / GC and ongoing fixes - I didn’t want to de-rail those, but I’ve got another issue I’ve been monitoring on my nodes that has become more apparent the last 3 months, impacting disk I/O and SQLite DB access…

I’ve been monitoring the usage of expired pieces when a client uploads, and there has been a marked increase on the US1 Sat, with regard the number of pieces being tagged on piece_expiration.db, and the load being generated by the (currently default) hourly chore to purge those files…

As an example, my 10 TB node has this information in the Piece_expiration DB;

2022|46
2023|42019
2024|4427639
9999|37912

My 3TB node has;

2023|1197
2024|378510
9999|13847

My 1TB node has;
2024|144590
9999|6572

As an aside, the 2023 / 2022 data has NULL for satellite_id, deletion_failed_at - and no, there are no delete errors as I don’t believe this condition is handled in the code - although it might be tidy to capture it

but my main point, is the increaased disk I/O I’m seeing from the larger nodes running the hourly chore to purge the expired pieces.

# how frequently expired pieces are collected
# collector.interval: 1h0m0s

monitoring the chore, on the large node the stats (avg approx.) are

Minimum files deleted (not to trash) ~ 5,000
Average files deleted ~ 34,000
Peak files deleted ~ 201,000

…

In the code, there are some hard coded limits, that I assume are there to prevent the Storage node from falling over. The defaults look out, as they would allow 2.4 Million files to be purged as expired in a 24 Hour period, roughly 1.5TB of storage, and this can be problematic for poorly performing disks or systems causing I/O issues - Obviously from the numbers above, we aren’t even close to this “yet” but even at current levels the I/O is starting to become visible.

github.com

storj/storj/blob/af5d446107a3e188367413cf05d1db8704475282/storagenode/collector/service.go#L78


      
          	return nil
          }
          
          // Collect collects pieces that have expired by now.
          func (service *Service) Collect(ctx context.Context, now time.Time) (err error) {
          	defer mon.Task()(&ctx)(&err)
          
          	service.usedSerials.DeleteExpired(now)
          
          	const maxBatches = 100
          	const batchSize = 1000
          
          	var count int64
          	defer func() {
          		if count > 0 {
          			service.log.Info("collect", zap.Int64("count", count))
          		}
          	}()
          
          	for k := 0; k < maxBatches; k++ {
          		infos, err := service.pieces.GetExpired(ctx, now, batchSize)

Also, the collector code isn’t very optimized - I understand, there are other issues to look at, but currently it is a sequential process, every piece is a meta lookup, a disk scan, then an update to Sqlite - I guess this is why SNO’s are having to move DB’s to SSD or NVME, this single DB is very high hitting the larger a node gets.

There is also a 24Hr unpaid discrepancy in the code - obviously not a big deal, but an SNO will be storing an expired piece for 24hrs after is was expired to cover Time zone anomalies, but the satellite only pays up to the expired time - US1 time ?

Also, if the customer decides to delete the piece before the expiration, the chore is still left in the Sqlite db, and gets executed by the node - this just feels weird, usually expiration is used for legal retention, so being able to delete the piece before it expires feels like a missing feature ?

Also, minimum retention ? there feels a very real exploit available in setting the expiration very low, and upload very high. There maybe should be a minimum charge period of a piece with expiration set ? maybe 7 days minimum charge and retention - I really can’t think of a user case where a piece needs to be expired in 24 - 48hrs after upload ?

and really finally…

Why does this collector chore even exist with it’s DB, and the impact on upload speeds due to writing to Sqllite, and the I/O on deletes…

It seems that there will always be a percentage of an SNO’s HDD that will be unpaid - given how the Garbage Collection works currently, and no scope to make it real time.

Why isn’t the who collector expired pieces code deleted, and left to the GC to handle removing the expired pieces - I understand there’s allot of rework there currently but it seems logical to bundle the expired deletes into this process.

From an SNO view point, It would be better than having hourly deletes happening all the time for expired pieces, along with file walker (lazy or not) after every node update, and the garbage collection kicking in, all at the same time !

Also, when you start re-writing the GC code, it would be amazing if the SNO’s could specify a cron style time format in the config to say when we wanted GC to run, after it received the bloom (maybe cache that to disk or DB) , and maybe even a maintence toggle - a file created in the storage directory called maintenance, which when the node saw it, would stop accepting new ingress with a nice graphic in the web ui

Anyhow, this isn’t a complaint - my nodes are working fine, I’m just reporting a new trend on expired pieces that might have a bigger impact on underperforming nodes.

CP

Knowledge · February 20, 2024, 9:47pm

Thanks, there is a lot to unpack here. I can say that there are some changes going in soon for trash management to reduce significant I/O.

github.com/storj/storj

Trashfolder cleanup is wasting too much IOPs and is running with wrong IO priority

opened 01:46PM - 18 Dec 23 UTC

littleskunk

Bug Needs Estimation

**Expected Behavior** The storage node receives a bloom filter and moves a lot …of pieces into the trash folder. The pieces should stay in the trash folder for 7 more days. In the end, a cleanup job should delete the pieces that are older than 7 days. That cleanup job should run with low IO priority like garage collection does. **Actual Behavior** Garbage collection is already optimized and there is a flag to run it with lower IO priority. The problem is the cleanup job. It checks the entire trash folder every 24 hours and on every storage node restart. 6 days in a row it is wasting IOPs by just checking the trashfolder without deleting anything. It is also running with normal IO priority which will impact ongoing Upload and Download activities. We should stop checking the trash folder that often and we should do it with low IO priority. **Possible Solution** How about garage collection creates a subfolder in the trash folder with the current timestamp? That way the cleanup job could just ignore all subfolders that are not yet 7 days old and wipe the folders that are older than 7 days. No matter how many pieces are in the trash folder it would be just 1 check instead of checking the timestamp of every piece in the trash folder. Also, run it with low IO priority like garbage collection.

As for why someone would upload something with a 24hr delete period. It really depends on the use case. I could see a monitoring system keeping 24hr file storage daily, and then having a weekly, monthly, or yearly storage. All depends on how they want to backup their data.

As to why some decisions are made on how deletes are handled, I will ask an engineer to take a look at your questions, but typically the choices are made to provide the end user maximum performance, and the nodes can sometimes have less optimal processes because of that. This is stuff that the team looks at and discusses for improving nodes overall performance as it becomes apparent we have an issue. Like anything though, major changes to how something is done takes a lot of discussion and planning. Anyway, I will see if someone can answer your other questions.

Toyoo · February 20, 2024, 10:20pm

On the topic of this change, as I cannot comment there due to lack of Gerrit account I believe this might be unnecessary complexity. I’m running my nodes with a much smaller, simpler change that achieves roughly the same effect. Wanted to submit it soon after I make sure it works for my nodes, but I guess I can just as well submit a draft PR… storagenode/{blobstore,pieces}: remember the earliest timestamp of a trash piece by liori · Pull Request #6789 · storj/storj · GitHub

Feel free to discard it though, the change in Gerrit also has its merits.

thepaul · February 20, 2024, 10:29pm

I think that’s entirely reasonable. For my own part, I wasn’t aware that the expired pieces collector was causing any issues for anyone. (Maybe some of us were aware; I’m just saying I wasn’t.) The collector only exists so that nodes can opportunistically remove pieces that it knows it doesn’t need to store anymore, without waiting for the next GC bloom filter. If that isn’t helping, let’s get rid of it.

Maybe we could even get rid of piece_expiration.db entirely, also saving time for the trash emptying filewalker, which has to try and remove expiration times for every piece it deletes.

I would love to be able to show maintenance tasks like this in the web UI, and allow (some of) them to be started or stopped manually. It’s just a matter of having resources allocated for doing the work. I think we’ll get there at some point. A cron-style task definition seems less probable, as we would generally prefer to avoid large sections of Unix-specific code.

Toyoo · February 20, 2024, 10:46pm

I don’t see any problems either on my nodes. If anything, piece expiration is a more efficient mechanism. GC would need to first move pieces to trash, only then remove them. Piece expiration saves I/O time.

And so, well, if a node gets a lot of pieces with expiration date set and complains about I/O, using GC to clean them up will make the I/O worse.

If there is an I/O problem with piece expiration, I would seek to solve it, not remove the mechanism altogether. Quickly looking at the code, there’s one potential improvement to be tried: running a single DELETE statement to remove entries of multiple pieces in a single transaction, as opposed to removing them one by one.

zip · February 21, 2024, 12:09am

I agree and would say all these different types of deletes the node can do only adds to code complexity. There should be one and only way to delete files from the node and GC looks like a reasonable way to do that, given it can now run as a best effort job.
In the future maybe there will be an option to change expiration time of an already uploaded file (if this isn’t a thing already), and then you would have to implement this to satellite and also the node code.
I was also reading some of the nodes are running with quite a big clock skew, which would imply pieces of the same file are being expired at a different times.
Just keep it simple and delete everything using the bloom filters and GC and get rid of everything else.

And to add, you can’t recover files expired by the node as those aren’t moved to trash, but deleted directly as I understand. Maybe in the future there will be an offering with the ability to recover files in the next N days, as they still are in the trash. In such case you won’t be able to recover the ones expired by the node and this would be another reason to unify how the files are being deleted on the node.

Alexey · February 21, 2024, 3:47am

The expiration date is a part of the object’s Metadata, so to alter it you need to reupload the object with a new expiration date.

snorkel · February 21, 2024, 5:16am

I thought about this too, even mentioned it on the forum, but then I realised that…: from a SNO point of view this would be ok, to stop the ingress when some intensive walkers run; from network’s point of view this would be unacceptable. Imagine sending a bloom filter and suddenly the entire available space on the network is gone, because the nodes have started colecting the garbage and they stopped the ingress.
The clients would be locked out of uploads.

thepaul · February 23, 2024, 5:10pm

By the way, it was supposed to be that anyone could log into our Gerrit using an existing Github account. If that’s not working, let us know.

Toyoo · February 23, 2024, 11:27pm

So, what I wrote is technically wrong, sorry. I do have an account, but I do not have permissions.

The wiki on Github says (emphasis mine):

Before requesting Gerrit access in Slack, you will need to log in at https://review.dev.storj.io/ with your Github account. Once you have done this, share your Github email and username, and ask for review and submit permissions in the #gerrit Slack channel.

Alexey · February 24, 2024, 3:46am

Interesting, it allows to see the change even without a GitHub account. I believe this message means that you want to contribute, it’s not required for viewing.

nerdatwork · February 24, 2024, 4:04am

If you want to just comment then you would need to sign in to Gerrit using github. You can then comment on any part of the code.

Here’s how it looks.