GC successful, bloomfilter not deleted, new bloomfilter not processed

While processing a (resumed) bloomfilter for 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs a new bloomfilter for that satellite has been received.

config.yaml
# how many concurrent retain requests can be processed at the same time.
# retain.concurrency: 5

Processing has finished:
2024-05-12T11:14:12Z INFO lazyfilewalker.gc-filewalker.subprocess gc-filewalker completed {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"

But:

/config/retain
May  6 10:56 ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-1714499999998475000
May  6 17:03 v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa-1714759199976870000
May 11 14:59 v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa-1715363999948993000

The bloomfilter for 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs dated 6th has been processed successfully but has not been not deleted from the retain folder.
There is no indication that the new bloomfilter dated 11th is being processed. There is no log entry about a new gc process for that satellite and there is no new date folder in the trash for that bloomfilter:

ls /storage/trash/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/
2024-05-06  2024-05-07

AFAIK new GC start would create a new date folder with date of start as name So I would expect to see a date folder wit name 11th or later.

1 Like

do you also have a succeed retain for this satellite and this BF?
Is the lazy mode on or off?
Does permissions matches other data in the config folder?

2 Likes

Do you mean this?
2024-05-12T11:14:27Z INFO retain Moved pieces to trash during retain {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 980751, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 2589893, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "121h34m16.533753733s", "Retain Status": "enabled"}

It looks like finished to me, I don’t see an error.

--storage2.piece-scan-on-startup=false \
--pieces.enable-lazy-filewalker=true \
/config/retain
-rw-r--r-- 1 root root 2445764 May  6 10:56 ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-1714499999998475000
-rw-r--r-- 1 root root 1224954 May  6 17:03 v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa-1714759199976870000
-rw-r--r-- 1 root root 1167648 May 11 14:59 v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa-1715363999948993000

The whole process seemed to be working as I think I even saw a bloomfilter for the Saltlake satellite which has been received, processed and deleted. I see a date folder there for 09/05. So this one had no issue in being deleted after processing.

It appears as if the problem exists because there are 2 files for the same satellite. It looks like it does not know which one to delete and which one to process now.

Does all other data have the same permissions (root:root)?
Do you use --user option in your docker run command?

Which data do you mean?

No.

1 Like

everything else in the /app/config folder inside the container, storage subfolder and databases, blobs, trash, etc.

Yes. I mean I have successfully completed and working GC runs for Saltlake and AP-1.
US-1 is currently running, so I cannot tell the outcome
But this where I have 2 bloomfilters is the only one that shows this issue.

Looks similar to this one:

I don’t see especially high CPU usage though.

What’s version of storagenode?

It is version 1.102.3

why not set retain.concurrency to 1 default?

I believe the default value has changed. I am not sure in which version it was. The generated config file would still contain the old default value commented out.

2 Likes

It seems that this would fix only the CPU issue that I cannot see on my hardware.

High CPU usage by node caused by receiving a new bloom filter while still processing the old one. Also new gc won't start after processing old bloom filter · Issue #6946 · storj/storj · GitHub

As temporary workaround you can try reducing --retain.concurrency from default 5 to something lower to reduce CPU load.

Even in version 1.104.1, the default value remains the same: 5. Release versions don’t have the fix yet.

I’m pretty sure you also have a CPU load problem (like in High CPU usage by nodes caused by receiving a new bloom filter while still processing the old one ) , just don’t expect that a single node would load all cores on your CPU. In this case, node would load one core to 100%. If you use Docker, please check the output of “sudo docker stats”, you’ll see CPU load caused by every node on the server.

And i believe that setting retain.concurrency to 1 should fix both problems: CPU load and processing new bloomfilter

Too late. The node had restarted.
But I can confirm this issue now again:

The node is still processing the bloomfilter from the 6th on US-1. Now the second time interrupted and resumed meaning that it created a new folder with date of today.
Still the same old bloomfilter.
So this data could be up to 3 weeks old when it finally gets deleted. If it was collected in the initial date folder, it would get deleted immediately with the next trash chore run.

That doesn’t matter in my opinion. As far as I know, the trash-filewalker wouldn’t start cleaning directory until the garbage collector finished filling it. So, it’s not guaranteed in which case the trash will be stored longer: if the garbage collector uses a single directory for trash collected by a single bloomfilter or if gc creates a few directories with different dates on every restart. I would prefer if it creates a new directory every day because the gc-lazyfilewalker may work for even 7 days or more for a single bloomfilter. I still have a few directories from April 22 (which are being cleaned right now by the trash filewalker) because the garbage collecting took a very long time on those nodes.

Ok. Then this could take a looooooong time until the data finally gets deleted from the node. And from the date name you could never be sure when this would be.

I don’t remember any code that would prevent both types of file walkers from running together. Too tired to check that now, sorry, but I think I would notice a guard like that…

1 Like

On my nodes I’ve definitely observed gc-filewalker and trash-cleanup-filewalker running at the same time.

1 Like

From the same satellite on the same date folder?
So like such a case where GC takes very long and the trash deletion kicks in. So you would have the GC moving sill pieces into the date folder while trash collection tries to delete the pieces and the folder.
I don’t know how that will work.

2 Likes