New Garbage Collection enabled on almost all satellites

With the deployment later today we are going to enable a new garbage collection process on SLC.
That would be satelliteID 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE. The new garbage collection process is using ranged loops. So instead of one worker iterating over all segments, there are multiple workers splitting the work into multiple ranges. Hopefully, the results of each worker will be combined correctly and sent over to all storage nodes.

Please keep an eye on the storage node logs and let us know if the new garbage collection process deletes any files that it shouldn’t delete. (I don’t expect that to be visible imidate. It usually takes a few days before the first audit or repair request error out with a piece not found message.)

SLC is also running the new ranged loop for a bunch of other services. Repair checker, audit reservoir sampling, storagenode tally, graceful exit also use the ranged loop. If you notice any strange behavior in these areas please also drop a message.

All the other satellites are not affected. We want to limit the risk to SLC for now.

Any questions while we are waiting for the first garbage collection?

9 Likes

Please shed more light on “offset” and “remote address” entries in the log. What do they mean and how are they beneficial ?

The remote address is the customer ip address (or satellite audit worker / repair worker or gateway-mt) that is uploading or downloading data from your storage node.

The offset is showing which range the customer wants to download. Offset 0 would mean the customer is downloading the first bytes from this piece. An offset of 1024 would mean the customer skips the first 1024 bytes and starts the stream from there. Video streaming is one use case that would do that.

I don’t know if they are benefitial to the storage nodes. They will help in case we need to fix a bug. In that situation it will be usefull to understand if all downloads are failing or just downloads with a specific offset or download from a specific remote address. So these additional information should be beneficial to the developers that need to find out what is causing the bug.

1 Like

For me they all just point to 172.17.0.1, which is the docker gateway… so not really useful at all. Though I’m sure it’s different for non-docker setups. Anyway… This is kind of off topic.

Will keep an eye on logs! Thanks for the heads up!

Just wondering how we can check that?

Which is without a question of doubt personal data under the GDPR. So AFAIK logging and storing requires consent or legitimate interest.

4 Likes

Please file a github ticket. I will make sure it gets addressed.

2 Likes

With some delay but yesterday we enabled new garbage collection process on SLC.

4 Likes

Received the bloom filter about an hour ago but there was no garbage to delete on my nodes:

2023-03-02T16:25:37.944+0100    INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "Created Before": "2023-01-22T22:43:04.901Z", "Filter Size": 239182, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2023-03-02T16:25:38.251+0100    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "num deleted": 0, "Retain Status": "enabled"}

First instance seen on 17th Feb

2023-02-17T00:21:02.782Z        INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "Created Before": "2023-01-22T22:43:04.901Z", "Filter Size": 534744, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2023-02-17T00:22:19.732Z        INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "num deleted": 178, "Retain Status": "enabled"}
2023-02-18T14:48:39.975Z        INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "Created Before": "2023-02-14T11:59:59.932Z", "Filter Size": 1189922, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2023-02-18T16:01:51.170Z        INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "num deleted": 60456, "Retain Status": "enabled"}
2023-02-20T10:41:09.389Z        INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "Created Before": "2023-02-15T11:59:59.997Z", "Filter Size": 674931, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2023-02-21T05:35:38.064Z        INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "num deleted": 1301969, "Retain Status": "enabled"}
2023-02-22T20:17:46.442Z        INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "Created Before": "2023-02-19T05:11:40.904Z", "Filter Size": 11586, "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
2023-02-22T20:18:11.387Z        INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "num deleted": 194, "Retain Status": "enabled"}
2023-02-23T16:11:07.822Z        INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "Created Before": "2023-01-22T22:43:04.901Z", "Filter Size": 534744, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2023-02-23T16:12:27.723Z        INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "num deleted": 19, "Retain Status": "enabled"}
2023-02-27T12:25:12.335Z        INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "Created Before": "2023-02-22T11:59:59.995Z", "Filter Size": 674931, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2023-02-28T03:56:45.477Z        INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "num deleted": 717214, "Retain Status": "enabled"}
2023-02-28T22:41:33.963Z        INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "Created Before": "2023-01-22T22:43:04.901Z", "Filter Size": 534744, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2023-02-28T22:48:49.290Z        INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "num deleted": 4, "Retain Status": "enabled"}
2023-03-02T05:03:22.483Z        INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "Created Before": "2023-02-26T03:09:49.692Z", "Filter Size": 11586, "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
2023-03-02T05:12:26.264Z        INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "num deleted": 79, "Retain Status": "enabled"}
2023-03-02T16:01:27.548Z        INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "Created Before": "2023-01-22T22:43:04.901Z", "Filter Size": 534744, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2023-03-02T16:03:15.783Z        INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "num deleted": 0, "Retain Status": "enabled"}

Is it enabled on all satellites ?

The new garbage collection process is only on SLC for now. In your logs you also had no garbage in the last execution. There was one more a few days earlier but I am not sure if that was already the new process or still the old one.

I think garbage mostly happens when there is incoming data, which I believe hasn’t really happened on that satellite for a while. Maybe it’s a good idea to rotate some test data?

It didn’t wiped the entire satellite. So thats good. Instead of rotating any test data we could simply enable it on the next satellite and see what happens. Basically no garbage to delete is kind of the result we hoped for. I would be worried if it would have deleted more garbage than usually.

Fair enough, but I think there are not really test satellites that get ingress these days, right?

The team is now double checking if the timestamp inside the bloomfilter is correct "Created Before": "2023-01-22T22:43:04.901Z"
This timestamp can be caused in 2 different ways. The machine that creates the bloomfilter might work with an outdated DB snapshot or there was just no file upload in the meantime. Just for safety the team will rule out an outdated DB snapshot before we continue.

Up next would be EUN 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB
Most ranged loop services are already enabled on EUN. Today we plan to enable garbage collection on EUN and the audit reservoir sampling on AP1.

4 Likes

could GC be more demanding on HDD IO now? Last few days I have seen few occasions when HDD IO increases a lot, than RAM usage goes up, than OOM intercepts. No big deletions though… Do you know about anything I can try to debug?

Out of memory: Killed process 2118398 (storagenode) total-vm:3772392kB, anon-rss:1530248kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:6180kB oom_score_adj:0

I can confirm. But I don’t think it was only the new garbage collection which seems only enabled for 2 satellites now.
Maybe this:

Screenshot_2023-03-08_093717

There was/is some “massive” ingress happening. One of my nodes nearly collapsed. Maybe it was everything at the same time: Update, filewalker, deletions, GC and massive ingress. :scream:

No. We didn’t touch the storage node code. It will still execute the same steps after receiving the bloom filter. Only way this might cause additional stress is if the bloom filter is missing a lot of pieceIDs so that your storage node starts to delete half the disk space or so. Given the current results with 0 pieces in our logs, I don’t think this is happening.

1 Like

So high memory usage indicates the slow disk subsystem: SMR, network filesystem (NFS, SMB/CIFS, SSHFS, etc.), BTRFS, zfs without tuning and/or cache, any kind of RAID, using USB or dying disk.