Current situation with garbage collection

elek · April 22, 2024, 11:50am

Yep, thanks the help. We will update the forum when we are ready for testing.

First we need to generate 10Mb files. The current generator uses ~100Gb memory and requires >5 days to be completed. We need to find out the right resource settings… So it takes some time.

BrightSilence · April 22, 2024, 11:57am

I took a while to run the grep, but mine is also smaller. Same 4099953 size. So I don’t think it means the smaller one was sufficient.

2024-04-22T02:06:19Z    INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-04-10T17:59:59Z", "Filter Size": 4099953, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-04-22T06:09:37Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 3648692, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 38672713, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "4h3m17.580952175s", "Retain Status": "enabled"}

snorkel · April 22, 2024, 1:36pm

Can this work be spread to a network of computers? Like in Folding @ home? Or it can be run only on the satellites servers?

littleskunk · April 22, 2024, 3:50pm

It can run only on the satellite. Well not exactly on the same machine. The point is it has access to sensitive information and requires a trusted environment.

We can split the work over multiple servers and scale it that way.

Alexey · April 26, 2024, 7:56am

4 posts were merged into an existing topic: Something wrong with trash?

snorkel · April 22, 2024, 9:30pm

If it helps anyone, on my oldest node I see this; the retain chore ran for 7.5 hours on 12TB node, Synology, 18GB RAM, Exos drive, node v.1.101.3

2024-04-21T23:35:21Z INFO retain Prepared to run a Retain request. {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-04-10T17:59:59Z", "Filter Size": 4099953, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-04-22T07:01:12Z INFO retain Moved pieces to trash during retain {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 2291483, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 44298116, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "7h25m51.806606092s", "Retain Status": "enabled"}

Toyoo · April 23, 2024, 3:38am

You’ve nerdsniped me. Please admit that this was your goal.

Some sample code for computing bloom filters with negligible amount of RAM and at a fraction of time: GitHub - liori/bloom_filters_for_nodes. generator is a sample dataset generator, and bloomfilter computers bloom filters. For the dataset of us1.storj.io size you need ~16 TB of storage, but you probably know that.

Can’t guarantee it’s correct because I’m drunk. But I’ve run it on my i3-9100f and I estimate that I could generate bloom filters for us1.storj.io in 20 hours with a single CPU thread.

The lousy attempts at concurrency didn’t pay off for me—just set thread_count to 1. i3.16xlarge is not much faster I/O is the bottleneck, but you probably know that. I guess I would have to get a machine like this one to actually take advantage of concurrency.

Algorithm details: basic radix sort.

Alexey · April 26, 2024, 7:59am

A post was merged into an existing topic: Something wrong with trash?

BrightSilence · April 23, 2024, 7:36am

I was wondering why the garbage collected more than a week ago is still in trash, but then I remembered reading the migration to the new trash structure would initially put all trash in a single folder with the current date. This was kind of an unfortunate timing with the upgrade to v1.101.3 as it moved the garbage when it was about to be deleted to the new structure and will remain there for an additional 7 days. It’s a one time issue, but a little unfortunate now that there is so much data in there. I figured I would post that in case someone else runs into this.

Toyoo · April 24, 2024, 1:48am

Updated my code with a more complex (better, but uglier) approach to concurrency. Tested it on AWS’ i3en.24xlarge, managed to run the computations for simulated 25k nodes and 20400631382 pieces (approx. 10% of what us1.storj.io has now). Runtime was 21 minutes. Memory use below 5 GB for the whole process, though the rest of 700 GB was used for caches and buffers, likely helped. I/O was at ~7 GB/s at the peak suggesting a single good consumer NVMe drive would do.

As such, I think <10 hours for this specific VM type the full dataset at the current scale seems possible. Possibly even quite a bit less than that on a more tailored setup. I don’t want to spend more money on AWS right now, and I do not have enough space myself to run a test on a full simulated dataset, so can’t test the hypothesis.

That was a fun exercise! I admit the code is terrible (don’t drink and code!), but I hope it will be useful for you. Test setup, starting from a Debian image:

sudo -i
apt -y install cmake g++ git
mdadm --create --verbose /dev/md0 --level=0 --raid-devices=8 dev/nvme[1-8]n1
mkfs.ext4 -O ^has_journal,sparse_super2,^uninit_bg,^resize_inode,fast_commit -E lazy_itable_init=0 /dev/md0
mkdir mp
mount -o noatime,delalloc /dev/md0 mp
git clone https://github.com/liori/bloom_filters_for_nodes
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ../bloom_filters_for_nodes
make -j
cd ../mp
ulimit -n 1000000
time ../build/generator 25000 20400631382
time ../build/bloomfilter .dat

Roxor · April 24, 2024, 11:40am

I only understand about 25% of what your talking about: but is bloom filter creation ever something that could be in-memory only? Like 5+year-old servers with 1.5TB of RAM can look pretty affordable if they get things done fast. Or is the working data so large you’re always going to need to be sustaining massive IO?

Toyoo · April 24, 2024, 8:27pm

Indeed, and right now I think it is in-memory. The fact that it takes 5 days must come from a different bottleneck.

Right now bloom filters for all nodes probably take around 100 GB, at least this is how I interpret elek’s message. However, you still need to transfer probably ~15 TB of node ID-piece ID pairs to that server, and at 1 Gbps this is 33 hours. It’s faster if it’s close to the database itself, i.e. in the cloud—an example VM with 128 GB of RAM and “up to 10 Gbps” networking (x2gd.2xlarge) costs 0.33 USD/hour now.

Ambifacient · April 25, 2024, 3:27am

Double EU1 garbage collection this week?

2024-04-23T16:58:23Z	INFO	retain	Prepared to run a Retain request.	{"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-04-18T17:59:59Z", "Filter Size": 562911, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-04-23T19:48:02Z	INFO	retain	Moved pieces to trash during retain	{"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 260747, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 1289571, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "2h49m38.783787271s", "Retain Status": "enabled"}
2024-04-25T01:42:49Z	INFO	retain	Prepared to run a Retain request.	{"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-04-20T17:59:59Z", "Filter Size": 565469, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-04-25T02:23:54Z	INFO	retain	Moved pieces to trash during retain	{"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 59648, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 1051978, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "41m5.11840053s", "Retain Status": "enabled"}

elek · April 25, 2024, 2:39pm

Yep, after some code optimization – to make the generation faster – we started generating 10 Mbyte bloom filters. EU1 filters are sent out during the last night, AP1 are under sending… Big node operators receive them only with nodes >=v1.100.

Next round will be generating a 4Mbyte US1, just to make sure that everybody receives at least one filter (even if they use older nodes).

After 4Mbyte US1, we will start generating 10 Mbyte. Exact testing method (sending out one by one or in one batch?) is still under discussion.

ps: yes we use one big machine for BF generation, yes everything in memory, yes we added more resources

ps2: I updated Storagenode Operational updates

Roxor · April 25, 2024, 3:25pm

So if the oldest/largest nodes are likely to be holding the most forever-free data…

…and if the oldest/largest nodes are most likely to be holding surplus data because the old bloom filter size was inadequate to identify enough trash…

…then with the new 10MB BFs… and continued forever-free deletions… those oldest/largest nodes may see fairly large increases in their local unused disk space?

It sounds like it may not be a great time for old-timer SNOs the next couple months. But then again they’ve been paid for that forever-free data for years…

elek · April 25, 2024, 3:54pm

TBH, I don’t fully understand you comment, I think you misunderstand sg. (and sorry if I was not clear enough).

You can be an old SNO (like somebody who started 4 years ago), with a lot of pieces. Just be sure that your software (storagenode) is new enough (v1.101+).

This full thread is about supporting “old” SNOs. (==storagenodes with a lot of pieces)

littleskunk · April 25, 2024, 4:22pm

It is the other way around. Lets say a big node is sitting on 20 TB used space but only 10 TB is paid by the satellite. The bigger bloom filter don’t change anything on the payout but they will remove a lot of the unpaid garbage and fee up space for more paid data.

Vadim · April 25, 2024, 6:20pm

I see that resent days significant amounts of data are wiped on my side. last 1-2 days

snorkel · April 25, 2024, 7:38pm

Watching these days problems with walkers and garbage collection, I wonder if Storj can realy scale up to Exabyte range? And the main bottle neck I see for now are the Satellites. If generating a bloom filter for 30PB of customer data takes so much resources, I can’t imagine how will work for 1000PB.
I still consider Storj in a beta faze, and I realy have hopes that we will figure things out along the way, but still I am somehow worried…

Ambifacient · April 25, 2024, 8:34pm

I think that’s easy, you just have more satellites, such that generating filters for each satellite is manageable.

I’m sure there are other approaches the team can do for scaling, but definitely not a priority at this stage.