Yep, thanks the help. We will update the forum when we are ready for testing.
First we need to generate 10Mb files. The current generator uses ~100Gb memory and requires >5 days to be completed. We need to find out the right resource settings… So it takes some time.
It can run only on the satellite. Well not exactly on the same machine. The point is it has access to sensitive information and requires a trusted environment.
We can split the work over multiple servers and scale it that way.
You’ve nerdsniped me. Please admit that this was your goal.
Some sample code for computing bloom filters with negligible amount of RAM and at a fraction of time: GitHub - liori/bloom_filters_for_nodes. generator is a sample dataset generator, and bloomfilter computers bloom filters. For the dataset of us1.storj.io size you need ~16 TB of storage, but you probably know that.
Can’t guarantee it’s correct because I’m drunk. But I’ve run it on my i3-9100f and I estimate that I could generate bloom filters for us1.storj.io in 20 hours with a single CPU thread.
The lousy attempts at concurrency didn’t pay off for me—just set thread_count to 1. i3.16xlarge is not much faster I/O is the bottleneck, but you probably know that. I guess I would have to get a machine like this one to actually take advantage of concurrency.
I was wondering why the garbage collected more than a week ago is still in trash, but then I remembered reading the migration to the new trash structure would initially put all trash in a single folder with the current date. This was kind of an unfortunate timing with the upgrade to v1.101.3 as it moved the garbage when it was about to be deleted to the new structure and will remain there for an additional 7 days. It’s a one time issue, but a little unfortunate now that there is so much data in there. I figured I would post that in case someone else runs into this.
Updated my code with a more complex (better, but uglier) approach to concurrency. Tested it on AWS’ i3en.24xlarge, managed to run the computations for simulated 25k nodes and 20400631382 pieces (approx. 10% of what us1.storj.io has now). Runtime was 21 minutes. Memory use below 5 GB for the whole process, though the rest of 700 GB was used for caches and buffers, likely helped. I/O was at ~7 GB/s at the peak suggesting a single good consumer NVMe drive would do.
As such, I think <10 hours for this specific VM type the full dataset at the current scale seems possible. Possibly even quite a bit less than that on a more tailored setup. I don’t want to spend more money on AWS right now, and I do not have enough space myself to run a test on a full simulated dataset, so can’t test the hypothesis.
That was a fun exercise! I admit the code is terrible (don’t drink and code!), but I hope it will be useful for you. Test setup, starting from a Debian image:
sudo -i
apt -y install cmake g++ git
mdadm --create --verbose /dev/md0 --level=0 --raid-devices=8 dev/nvme[1-8]n1
mkfs.ext4 -O ^has_journal,sparse_super2,^uninit_bg,^resize_inode,fast_commit -E lazy_itable_init=0 /dev/md0
mkdir mp
mount -o noatime,delalloc /dev/md0 mp
git clone https://github.com/liori/bloom_filters_for_nodes
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ../bloom_filters_for_nodes
make -j
cd ../mp
ulimit -n 1000000
time ../build/generator 25000 20400631382
time ../build/bloomfilter .dat
I only understand about 25% of what your talking about: but is bloom filter creation ever something that could be in-memory only? Like 5+year-old servers with 1.5TB of RAM can look pretty affordable if they get things done fast. Or is the working data so large you’re always going to need to be sustaining massive IO?
Indeed, and right now I think it is in-memory. The fact that it takes 5 days must come from a different bottleneck.
Right now bloom filters for all nodes probably take around 100 GB, at least this is how I interpret elek’s message. However, you still need to transfer probably ~15 TB of node ID-piece ID pairs to that server, and at 1 Gbps this is 33 hours. It’s faster if it’s close to the database itself, i.e. in the cloud—an example VM with 128 GB of RAM and “up to 10 Gbps” networking (x2gd.2xlarge) costs 0.33 USD/hour now.
Yep, after some code optimization – to make the generation faster – we started generating 10 Mbyte bloom filters. EU1 filters are sent out during the last night, AP1 are under sending… Big node operators receive them only with nodes >=v1.100.
Next round will be generating a 4Mbyte US1, just to make sure that everybody receives at least one filter (even if they use older nodes).
After 4Mbyte US1, we will start generating 10 Mbyte. Exact testing method (sending out one by one or in one batch?) is still under discussion.
ps: yes we use one big machine for BF generation, yes everything in memory, yes we added more resources
So if the oldest/largest nodes are likely to be holding the most forever-free data…
…and if the oldest/largest nodes are most likely to be holding surplus data because the old bloom filter size was inadequate to identify enough trash…
…then with the new 10MB BFs… and continued forever-free deletions… those oldest/largest nodes may see fairly large increases in their local unused disk space?
It sounds like it may not be a great time for old-timer SNOs the next couple months. But then again they’ve been paid for that forever-free data for years…
TBH, I don’t fully understand you comment, I think you misunderstand sg. (and sorry if I was not clear enough).
You can be an old SNO (like somebody who started 4 years ago), with a lot of pieces. Just be sure that your software (storagenode) is new enough (v1.101+).
This full thread is about supporting “old” SNOs. (==storagenodes with a lot of pieces)
It is the other way around. Lets say a big node is sitting on 20 TB used space but only 10 TB is paid by the satellite. The bigger bloom filter don’t change anything on the payout but they will remove a lot of the unpaid garbage and fee up space for more paid data.
Watching these days problems with walkers and garbage collection, I wonder if Storj can realy scale up to Exabyte range? And the main bottle neck I see for now are the Satellites. If generating a bloom filter for 30PB of customer data takes so much resources, I can’t imagine how will work for 1000PB.
I still consider Storj in a beta faze, and I realy have hopes that we will figure things out along the way, but still I am somehow worried…