Current situation with garbage collection

littleskunk · April 2, 2024, 4:47pm

Some of you might have noticed a higher amount of garbage. This has multiple reasons.

Satellites didn’t send any bloom filters for a week. The job that generates the boom filter was running but couldn’t upload the bloom filters to storj. The shared project we are using for this was hitting the limits. We fixed this issue. Last weekend all nodes should have received a bloom filter to work with. Just that this bloom filter will now have to clean up a bit more garbage than usually.
Storagenode v1.100 and v1.101 have problems to run GC with lazy filewalker enabled. If you are running v1.100 or v1.101 for what ever reason you can either downgrade to v1.99 or disable the lazy filewalker. Since disabling the lazy filewalker can consume a lot of resources I would recommend downgrading.
Bloom filter size too small for bigger storage nodes-> higher false positive rate. Bloom filters have a chance for a false positive. That would be a garbage piece that is a false match against the bloom filter and the storage node keeps it on disk instead of deleting it. The false positive rate depends on the size of the bloom filter. The more pieces a storage node is storing the bigger the bloom filter needs to be to keep the false positive rate low enough. The issue is we are currently limited by DRPC message size. We have a fix ready but need to wait until all storage nodes are updated. Full history of this ticket can be found here: ☂ {storagenode,satellite}/gc: bloom filters are ineffective with large storage nodes · Issue #6686 · storj/storj · GitHub
Deletion of suspended accounts → more garbage. Over the years we suspended many accounts for different reasons. For example if a customer doesn’t pay an invoice the satellite will warn the customer first and later suspend the account. If the customer still doesn’t pay the outstanding invoices the account ends up in a pending deletion state. For a long time that was the end of most account suspensions. We locked the account, marked it for later deletion but kept the data in the network. Now last week we run our new deletion procedure for the first time and removed a cleaned up a lot of free tier abuse data.
Time delay between data deletion on the satellite side and storage node receiving the corresponding bloom filter. I believe number 4 wasn’t reflected in the last bloom filter this weekend. So even if you are running a smaller node that isn’t affected by most of the above points you should see a higher amount of garbage that hasn’t been moved into the trash folder yet. The bloom filter creation runs from a DB backup that is a few days older. I would expect the next bloom filter to move more data into the trash folder to the point that the used space reported by the satellite gets closer to the used space reported by the storage node itself.

Ruskiem · April 2, 2024, 8:38pm

What? NOICE…
i mean…
Thx for honesty and … generosity.
Great You are doing the spring cleaning as well!

Once again. can version 1.95.(windows) also fully capable to benefit from latest technology achievements in bloom filters? or i have to update to something else?

littleskunk · April 2, 2024, 8:45pm

Depends on which improvements you mean. The lazy filewalker is already in the code for many releases. Even an older version would still have that improvements.

d4rk4 · April 2, 2024, 9:51pm

When do you intend to roll out fixes for GC for the 1.101.x series? I manage significantly large nodes and am interested in utilizing on-disk caching for Bloom filters.

littleskunk · April 2, 2024, 10:47pm

We don’t have a fix yet. It will take at least a few more days.

Toyoo · April 3, 2024, 12:28am

This is also interesting. It seems that segments that got axed were predominantly of the smaller size (source):
Zrzut ekranu
Some legend because Graphana is so bad at it: this is last 7 days, the “flat line” is actuall starting at 8.69 MB and very slowly going down, as it was for the last months.

As a node operator I like this change, as it means faster file walkers per TB of data stored in future.

Though, this also means though that the immediately incoming GC/trash file walker runs will be truly monstrous due to the absolute number of files trashed/removed. Enjoy the ride, because those will be the biggest file walker runs ever!

Toyoo · April 3, 2024, 8:29pm

Party’s canceled.
Zrzut ekranu

Ruskiem · April 4, 2024, 5:44am

Oh boy, You mean an Earthquake Alert?

Okay so when the BIG bloom filter is expected to arrive please?
Because for example, if it’s really big, and my nodes are in constan used-space filewalker, ("storage2.piece-scan-on-startup: true"), in hope, it will fix discrepancy, but now to what You said, it seems it doesn’t fix any, so it seems to be wise to turn off all used-space FW, so the disk is calm and 100% ready to take what will probbaly do fix it, the big GC wave, right?

If You could tell the date of the bloomfiler to arrive that would be nice.
Also maybe an official tsunami alert if You like the idea,
i really think the used-space FW is pointless right now, much more important is to “make space” for next trashman cleaning filewalker, and everyone should be aware to make their disk ready for that in first place so it won’t get interrupted.

littleskunk · April 4, 2024, 9:10am

That is a different story. To my knowledge you can keep the used space filewalker disabled for a long time without any discrepancy. The used space filewalker will calculate how much space you have localy on disk. If your storage node dashboard shows you the same used space locally as your OS is reporting there is no need to run the filewalker at all. All it would do is finish again with the same result.

The next GC run will be about as expensive as any other GC run. The expensive part is to check all the piece IDs against the bloom filter. The more pieces your node has the longer this part takes regardless of how many pieces will be moved into the trash folder. On my nodes I can see that up to 20% of my used space is garbage. That means 80% of the pieces will still cause the same amount of GC runtime and just 20% will be a bit more expensive.

Ruskiem · April 4, 2024, 9:39am

Thx!
it starts like this friday right? (or saturday)

tankmann · April 4, 2024, 7:05pm

Mine is also collecting a lot since weeks, is that normal in regards to this post?

Feels that this is WAY too much?
Thanks

pangolin · April 4, 2024, 8:00pm

This is looking very much like my nodes. The older the node the bigger the loss in percent.

Prepare to see more carnage within the next weeks.

Toyoo · April 4, 2024, 8:00pm

To add a potentially useful information: The one case I know of that the cached used space diverges from actual usage is when there’s a significant number of failed to add bandwidth usage errors. If a node operator sees them, this signifies that at some point the node may severely misreport the amount of disk space and a used space file walker will be needed. However, the operator should probably first fix the root cause of these errors, usually slow I/O… and only then enable the file walker for one full run, preferably while reducing the allocation to disable uploads for the duration of the run.

This is likely not true unless you have tons of RAM. Moving a file to trash is an expensive operation, the more files there are to be moved, the more time it will take. Granted in usual case where the number of trashed files is small compared to the total this is insignificant. However I wouldn’t be surprised to see the file walker taking multiple times the usual if a GC trashes 10% of all files—and this ratio is what I’m observing on my nodes.

To expand on this. We have three potential outcomes for each file.

The file is in the bloom filter. This is verified by just checking the file name, and as such a single directory read covers hundreds of such files, maybe even all of them. Very cheap, negligible for any larger node.
The file is not in the bloom filter, but it’s too young to be covered by it. This requires verifying file’s metadata, which is one random read per file.
The file is not in the bloom filter, and it is actually old enough to be covered by it. This requires verifying file’s metadata, then three additional writes: to remove the file from the blobs subdirectory, to add the file to the trash subdirectory, and to update the file’s metadata with a new mtime — three random writes.

(estimates for a case where per-file metadata do not fit in memory, but higher-level structures do, which is what I believe is a target to optimize for; besides, there’s probably additional I/O amortized over a large number of directory operations, like growing a directory htree).

Consider a node with 10M files, where 1% of them were removed (100k), and some number of added files—let say 200k. Case by case:

9.9M files that truly should stay + 10k files that are in the bloom filter, but were actually deleted (false positives).
200k files → 200k random reads.
100k - 10k = 90k → 90k random reads, 270k random writes.

Total: 560k random I/O operations. At 120 IOPS this is 1.3 hours.

Same thing for the ratio I’m observing: 10M files, 10% removed (1M), again 200k added.

9M + 100k = 9.1M
200k files → 200k random reads
1M - 100k = 900k → 900k random reads, 2.7M random writes.

Total: 2.9M random I/O operations, or 6.7 hours.

Mitsos · April 4, 2024, 8:14pm

As far as I know moving doesn’t actually create a new file, it just changes the original file’s directory.

Toyoo · April 4, 2024, 8:54pm

Yep. This means removing the file from the original directory and adding it to the new directory. Both operations require a separate write.

Mitsos · April 4, 2024, 9:07pm

The file remains at its old position, the filesystem’s table is updated with “directory abc contains file xyz which is at sectors 10,000 - 12,000” now says “directory def contains file xyz which is at sectors 10,000 - 12,000”.

Toyoo · April 4, 2024, 9:18pm

At least in ext4 and ntfs, the contents of directories are stored in separate data structures for each directory. This means there’s no single data structure that maps all files to all directories, instead there’s one data structure (usually an htree, which makes finding a single file with a directory by name fast) that lists files for directory foo, and one data structure that lists files for directory bar. Otherwise if you wanted to list files in a directory, you’d have to scan all this table for all files stored in the whole file system, which would be inefficient.

I would like to find out how this is stored in ZFS, but I couldn’t find any good high-level description of its data structures.

When you move a file from foo to bar, you need to update both data structures, hence two writes.

Mitsos · April 4, 2024, 9:28pm

on ext4 (if memory serves right, I’m getting old and senile) data structures (technically metadata) are stored in inodes. These inodes point to the parent directory. An mv on a file within the same disk (filesystem) calls a rename() to update this parent directory. The original file contents remain untouched. You are only updating the metadata of that file. You can test this by creating a 100GB file and moving it within a single filesystem (ie from /mnt/foo/bar to /mnt/foo/cat).

If you are moving files across filesystems (ie to a different disk) then yes, data (+metadata) needs to be copied over and then the original is removed.

Toyoo · April 4, 2024, 9:41pm

This is wrong. inode can’t point to a single “parent” directory, because an inode can be in multiple directories at the same time (this is called a hard link). Instead, directories point to files.

This is the ext4 inode: 4. Dynamic Structures — The Linux Kernel documentation

This is the ext4 htree directory data structure: 4. Dynamic Structures — The Linux Kernel documentation

Mitsos · April 4, 2024, 9:48pm

The first four bytes of i_block are the inode number of the parent directory. Following that is a 56-byte space for an array of directory entries; see struct ext4_dir_entry. If there is a “system.data” attribute in the inode body, the EA value is an array of struct ext4_dir_entry as well. Note that for inline directories, the i_block and EA space are treated as separate dirent blocks; directory entries cannot span the two.

Edit: from your link