Tuning the filewalker

Alexey · October 29, 2023, 1:01pm

it’s worse… the space is reported by storagenode.

thepaul · October 29, 2023, 6:12pm

Yes, that’s true, in some cases. The space-used filewalker output determines how much space the node considers available. The space-available value gets sent to the satellite, and the satellite uses that to determine whether to consider your node for new pieces. So it has an effect in choosing between “node has enough space for more pieces” or “node does not have enough space for more pieces”.

Yeah, you’ve got it. The space reported by the node is used in that one important way, but you could think of it as a boolean flag sent to the satellite: “please send me more data” versus “don’t send me more data”.

Doddophonique · December 17, 2023, 11:40am

If anyone uses docker compose and doesn’t want to change the file every time:

command:
  - "--storage2.piece-scan-on-startup=${NODEX_WALKER:-false}"

It will be false by default every time you create the container without setting the NODEX_WALKER variable to true.

snorkel · January 16, 2024, 2:44pm

Just for statistics…
I restarted 2 nodes in 2 machines (Synology DS220+, 18GB, Exos drives, ext4, noatime, logs on fatal, no ssd), same capacity, same space occupied, same setups; both machines are running 2 nodes, but only the test ones are restarted:

node11, sys1 - 13.1TiB from DSM, ~14.2TB used on Dboard, LAZY ON, FW runs for 57 hours.
node21, sys2 - 13.2TiB from DSM, ~14.5TB used on Dboard, LAZY OFF, FW runs for 39.5 hours.

In Lazy mode, the FW run takes more time (+50%), but the system is more responsive and you don’t loose audit online score.
In non-Lazy mode, the FW runs quicker, but the system is less responsive and I lost ~3% of online audit score, from 99.8 to 97.2.

To enable or disable Filewalker and Lazy mode, you can set these parameters in config.yaml:

pieces.enable-lazy-filewalker: true/false
storage2.piece-scan-on-startup: true/false

daki82 · January 18, 2024, 8:10pm

might be interesting filewalker is including the databases, or trys to keep on track with incoming files?

Alexey · January 19, 2024, 3:18am

These filewalkers updates DBs after successful scan. And they run periodically to update databases as well.

snorkel · January 27, 2024, 8:54pm

During the run, does the FW sends info to the satellites? Does the node needs uninteruptible internet access? Or if the internet access is lost during or when FW finishes, all it’s progress is useless and needs to do a new run?

Maybe there is an answear somewere, but after 3 years, I still don’t understand how all the pieces work together…
Is it like this?

satellites send bloom filters, aka lists of pieces that should be deleted, to nodes that have those pieces;
these lists are stored in piece_expiration.db;
the GC checks periodicaly this db and than sends the pieces to Trash;
after 7 days, GC deletes those pieces for good;
the FW reads some data from all the pieces from the drive and tells who what?
why does GC needs FW to delete those pieces, if the node knows that it stored them and the bloom filtes tells GC what to delete?
if the node itself takes note of all the piecese that are stored, moved to trash and deleted, why is there the need for Filewalker?

Maybe there should be a post deficated to all the services, and all the parts of the storagenode and satellites, how they interact and how they take care of pieces, because I am certain many of us are confused and have wrong impressions.

Toyoo · January 27, 2024, 10:24pm

No.

File walker is not affected by being offline.

Yes.

No, a bloom filter is not a list of pieces that should be deleted. See my text on what a bloom filter is.

No, this database is used to store information on pieces where the uploader decided at the time of upload that they are to be deleted at specific time. Most pieces do not have this information.

No, this database is checked by another chore.

No, this is a separate chore.

Not sure what you ask about.

The node does not actually keep information on what it stores, and it doesn’t have information what it is expected to store. All it knows is that there is a directory with files to serve. If you manually put a piece file there, it can be served as well (assuming all necessary cryptographic signatures match what satellite/customer expects).

Alexey · January 28, 2024, 5:49am

No. It works locally and when it finishes, it updates the database. Before this, your node reported what was in the databases.
Databases are the cache, they doesn’t used in the FW, because if these databases are corrupted, they will not have enough information about pieces and their location. The node doesn’t know, what it should store and what is not, this information exist only on the satellites, however, they accounted not per node, but per segment.
FW updates databases when the FW is finished, to show the correct information on the dashboard and to send it to the satellites on the check-in.

But we tried to use databases in the past to keep information about pieces and orders. However, if these databases were corrupted, the node might not found a requested pieces and will be eventually disqualified, even having all pieces in place. So, we do not use databases to store a crucial information about pieces or orders. The trade off for durability and speed of accounting of pieces is the current implementation.

snorkel · January 28, 2024, 6:16am

I was trying to explain myself a part of the used space discrepancies, because in some locations, the internet is not so stable, but now it makes sense; the stable internet locations show the same problem.

Alexey · January 28, 2024, 6:23am

It’s already discovered:

Failed FW because of disk is too slow (context canceled errors)
Restarted FW because of disk is too slow (FATAL errors on timeout for reading/writing)
Too many pieces on the node (the Bloom Filter is too small to cover all of them) - the fix is

snorkel · February 3, 2024, 11:29am

More stats… I started FW=ON, lazzy=OFF, on Synology DS220+, 18GB, with 2 nodes, ext4, (1) Exos 16TB - 12TB filled, (2) Exos 22TB - 1 TB filled. Today, for a few hours, my internet was down, so no egress or ingress, so no writes, only reads. As you can see below, when there are no WRITES, the READS are increasing drastically, x3, from 500 IOPS to 1500 IOPS, and throughput slightely.
So, if you want to increase the speed when moving a node to other disk, or if you want to accelerate the Filewalker run, you can put the ingress on hold - by reducing the allocated space below the occupied space, or block the internet access .It realy makes a difference. … or use some sort of cache.

Alexey · February 4, 2024, 4:12am

But being offline will likely remove more pieces from your node in the next GC run.

snorkel · February 4, 2024, 6:53am

Yes, that’s the bad part. Maybe if you have a node too big, like 15-20TB, and you don’t want to keep the FW on all the time, you can use the allocated space method to run it from time to time, like once a month or 2 months.

Eliminate the parameters from the run command, regarding allocated space and filewalker.
Modify the config.yaml to reduce the space, enable FW and disable Lazy FW.
Stop, remove container and restart container with the new run command.
Let the FW finish.
Modify the config.yaml to correct allocated space and disable FW.
Restart the node.
Now you don’t need to modify the run command anymore, or remove the container. You just modify the config and restart the node with the recomanded command from docs, with -t 300.
I think I’ll do this myself. I’ll just wait to see how moving the db-es to USB stick helps.

Alexey · February 4, 2024, 10:30am

Moving DBs is an another task, not related to the filewalker. They want to have FW to finish their job first.
The improvement is under their way.

snorkel · February 10, 2024, 8:56pm

These are all my test results regarding Filewalker. Now we are at 1.96.6 version and the tests are done with older versions.
Test machines are dedicated for Storj only; Synology DS218+ 1GB RAM and Synology DS220+ 10GB and 18GB RAM, ext4, noatime, no RAID, no cache, no SSD, all the storagenode files and OS are on node’s drives. Log level is on “fatal”.
The 1GB sys has 2x Iron Wolf 8TB, the rest have Exos 16TB (big nodes) and 22TB (new nodes of 700GB).
Writecache is enabled. All drives are formated as 512e, exept the new 22TB ones. I learned too late about fastformat to 4Kn. The nodes were not full at test time, so they had ingress and egress.

FW run increases somehow exponentialy with occupied space, not linear, so if you have lets say 1h/5TB, you won’t have 3h/15TB, but way more.
Anything that reduces the I/O on drive benefits the FW run.
CPU and internet access and speed are not important.
The biggest influence on FW run time it has the RAM and any sort of cache for matadata, etc., at least for Linux, because it utilises the entire memory available for buffers and cache.
Next will be the ingress. If you cut the ingress by reducing the allocated space below the occupied space, FW run will go much faster. It reduces writes, and increases reads.
I did’t messured the effect of the log level, but of course, the info level will impact negatively the FW run.
The smallest influence on FW out of main factors it has moving db-es on other medium, like SSD or USB stick. Is not like the others listed above but it helps. And of course, you get rid of db lock errors.
Apart from these factors, the lazzy FW takes aprox. 50% more time than normal FW.

RESULTS (in the order I did them, in a span of 1 year, on different nodes):
A. RAM test, lazzy off:
18GB RAM - 9.52TB - 7.5h, 0.8h/TB - 1 node
10GB RAM - 8.85TB - 29h, 3.3h/TB - 1 node
1GB RAM - 4.65TB + 4.03TB - 58h, 6.7h/TB for both nodes

B. LAZZY mode test - node full (14.5TB, so no ingress)
18GB RAM - 13.1TiB, lazzy ON - 57 hours, 4.35h/TB - 2 nodes running
18GB RAM - 13.2TiB, lazzy OFF - 39.5 hours, 3h/TB (3% online score lost ?) - 2 nodes running

C. Testing one node of 11.4TB, with a seconde new node running in parallel:
LAZY OFF, DB-es on HDD:
Total run time - 43h, 3.77h/TB (6h no ingress, internet down, IOPS Reads x3)

LAZY OFF, DB-es on HDD, no ingress (Reads IOPS x3 than normal):
Total run time - 29h 40min, 2.6h/TB

LAZY OFF, DB-es on USB 3 (Samsung Bar Plus 128GB):
Total run time - 43h, 3.77h/TB

During FW run:
USB Read peak speed 190KB/s, IOPS 80
USB Write peak speed 700KB/s, IOPS 52
Utilization 11% max

After FW run:
USB Read peak speed 150KB/s, IOPS 34
USB Write peak speed 1024KB/s, IOPS 74

snorkel · February 10, 2024, 9:04pm

The USB speeds reffer to USB stick that has only the databases on it, from both nodes.
The USB db got the same time as HDD db, but take note of the fact that for 6h the internet was down, and there was no ingress; so the HDD db would take much longer. Maybe someone can do the math.

Toyoo · February 10, 2024, 11:02pm

This is weird, there’s plenty of memory to cache metadata in this case. I would expect maybe up to 20 mins/TB in this case with default ext4 settings. Do you mind posting dumpe2fs -h for this file system (UUIDs and time stamps are not needed, you can censor them)?

snorkel · February 10, 2024, 11:07pm

I’m not a linux guy. Should I run this exactely how you posted it? With sudo?

Toyoo · February 10, 2024, 11:23pm

Sorry, I don’t know how would you need to run it on a Synology box. It would indeed need to run as root (so with sudo), and you need to add your block device path as an additional parameter to the command (so, sth like dumpe2fs -h /dev/your-device). The output will then resemble sth like this:

dumpe2fs 1.47.0 (5-Feb-2023)
Filesystem volume name:   root
Last mounted on:          /
[… around 60 lines total …]
Journal start:            196593
Journal checksum type:    crc32c
Journal checksum:         0xe588f439