Large temporary folder

thepaul · October 8, 2020, 2:57pm

There is a ticket for it, but it hasn’t been scheduled or roadmapped by project management yet. I will let them know that some people are very interested in its progress- maybe that will get it in faster.

Yes, I agree. Those are valid issues. Node start, though, is an exceptionally good time to do this task; we could schedule a tempfile cleanup task every N days, but during normal runtime it may be very hard to know whether a tempfile is still open and expected to be present in some other part of the system. When the node is starting up there is an easy guarantee that files with mtime < starttime can be thrown away.

So there are good reasons to do this at start, also. However, maybe we can make everything work without adding another on-start task. We’d need to do a quick audit of all code touching files in the temporary directory to be sure, but maybe there is some age A where all tempfiles of age >= A can be safely deleted. If that is the case, we would be better off using a regularly scheduled task instead of doing it on start.

No, garbage collection normally happens when a Retain request is received from a satellite. It doesn’t have anything to do with node start.

It’s not entirely clear here- are you using hyperbole here, or is there really something that gets run 8 times in a row when you stop and start the node? Because yes, certainly, nothing should need to run 8 times in a row when that happens.

If you mean that you stopped and started the node 8 times in a row, and the directory traversal happened once each time, then yes, that’s a thing that happens. The reason for it is because the software needs to know how much space it has used on the drive (not how much space is used on the volume, but how many bytes have been used for file storage inside the node’s blobs directory). Without that, we wouldn’t be able to provide the “don’t use more than X bytes” feature.

We keep track of changes to the space-used value when writing new blobs or deleting blobs, but since the fs is of course not transactional, there is always a chance of our cached space-used count getting out of sync with what’s actually on disk. And when the service is newly started, there is always a chance that a previous invocation of the service crashed without being able to persist the space-used value to disk, so the risk of exceeding our data allowance is even higher.

There are some possible mitigations we could employ, though:

introduce a scheduling layer for blob i/o and make the space-used update traversal be low priority, so that it would only make progress when there is no other i/o traffic going on
add a config item indicating that there is no limit on the amount of space used, other than the size of the volume in which the storage directory lives. In this case, the node could use the volume’s filesystem stats to determine space used without ever doing a dir traversal
add a config item which says explicitly “don’t do a directory traversal to update space-used; just trust the last value that you saved instead”. You could use this when tinkering to avoid incurring the extra load.