Disk usage discrepancy?

Alexey · December 21, 2023, 3:44am

Each satellite is independent of each other, so all filewalkers working for them are running independently too.

it’s possible, yes. The sizes of the pieces are vary, your node stores only 1 piece from 80 for the segment of data (64MiB or less). If the segment is less than its Metadata, it’s stored as an inline segment on the satellite itself, see Understanding Hierarchical Data Structure and Advanced Terminology - Storj Docs

it has data signed by your node, it’s cryptographically proved.
However we could have a bug in the generation of the bloom filter - it could cover not the desired amount of garbage, like not 90% but less or skip some too small pieces or nodes, etc.
I can see only these: Issues · storj/storj · GitHub

snorkel · December 21, 2023, 5:31am

Is it possible that these 2 settings are creating overhead and this discrepancy?

--filestore.write-buffer-size 4MiB
--pieces.write-prealloc-size 4MiB

On nodes with these settings, the difference is 6%. On nodes with default settings, the difference is 2%.
The occupied space on disk from OS coincides with the one from dashboard, so no wrong values in db.

Alexey · December 21, 2023, 6:34am

These parameters could affect the memory usage and if your disk subsystem is slow, it could affect the response time of the disk too.
You may just remove these custom parameters and check.

kchiem · December 21, 2023, 6:38pm

How many filewalker-type processes are there?

Of those, do they all run once per satellite? If not, which do and which don’t?

How frequently do they run?

How fast are they expected to complete before the disk is considered slow?

Alexey · December 22, 2023, 5:06am

At least three:

gc-filewalker
lazyfilewalker
retain

Also there are scan chores (they are technically not filewalkers, but do scans too):

collector
piece:trash

Each of them perform only one task, but the results are used by the next filewalker (gc-filewalker will collect the garbage (using bloom filter received from the satellite), retain will move the garbage to the trash, collector will remove the expired data, lazyfilewalker caches and sends information about used space to the satellites), piece:trash deletes data older than 7 days.
You need to check that each of them has been started for each trusted satellite, then successfully completed without errors.

each configured differently. gc-filewalker runs at least once a week (it’s depends on the satellite - each sends a bloom filter with their cadence), lazyfilewalker on each start and then renews cache every hour by default, the collector runs every hour by default, see

storagenode setup --help | grep interval

      --collector.interval duration                              how frequently expired pieces are collected (default 1h0m0s)
      --storage2.cache-sync-interval duration                    how often the space used cache is synced to persistent storage (default 1h0m0s)

Some depends on the satellite (gc-filewalker), some depends on the last run (retain runs weekly), some hardcoded (piece:trash runs every 24h).

They should finish before the next restart. The restart could happen if your node get an update (roughly every 2 weeks) or if it’s crashed because of a FATAL error.

Mad_Max · December 25, 2023, 12:13am

I think that at least a significant part of the “missing” space (not counted as storage, but in fact occupied by files on disk) is due to the fact that the files of 2 satellites decommissioned in the summer have not yet been deleted, still. US2 and Europe-North.

They have not been working for a long time now, but I see on all my nodes in the “blob” folder folders that once belonged to these satellites are still present and contain a large number of files and data. In my case, there are about a hundred thousand files and about 0.5 TB. Folders
6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa
and
arej6usf33ki2kukzd5v6xgry2tdr56g45pp3aao6llsaaaaaaaa

They are also present in the trash folder, although they do not contain files there. But about 2 thousand empty subfolders(1024 each) that go through the processes of cleaning the trash every time, the garbage collector process, the lazy file walker doing meaningless unnecessary work.

When are you going to ACTUALLY delete them? The satellites will never give a command to remove them, because they are decommissioned. The garbage collector, as I understand it, will not touch these files either for the same reason. On one of the nodes, I deleted these folders manually and everything seems to be working well. But depending on manual user intervention is clearly the WRONG way to handle such garbage left over due to decisions made by the company.

Mad_Max · December 25, 2023, 12:45am

P.S.
I don’t know, maybe this is already being discussed somewhere else (then I will be grateful for the link and this part of the message can also be moved there), but I would like to write that the Storj garbage collector and the “lazy” file-walker are extremely ridiculously slow.

On a large node(10+ million files) on Windows, one cycle of going through node data folders by filewalker takes 20-50 hours! Just to count the space occupied. Complete stupidity!

At the same time, using Windows tools(folder property for example), the same process takes only a few tens of minutes for “cold” and a few minutes for “hot” (when most of the metadata is already in the cache).
Approximately the same speed (less than an hour for a “cold” start, immediately after computer reboot, and less than 10 minutes for a “warm” start) i got I have written a simple script that reads all folders and metadata of node files (names, sizes, dates/times of creation and modification of all node files). But Storj, written by a group of professional (presumably) programmers, for some reason makes it almost 100 times slower at the same time with a constant almost 100% disk load.

I even looked at the performance monitoring tools - why IS IT SO SLOW? By the nature of the operations, it looks like for some reason the Storj filewalker reads all the metadata and folders strictly in 4 kb pieces(or 1 cluster per read operation) and at the same time ignores (prohibits the use of) any system caching. Even if the metadata is already in RAM (file system/OS cache), it still always reads directly from disk and always only 4 KB per operation.
As a result, reading all metadata produces millions of direct reads from disk “bypassing” all caches. With a typical mechanical HDD speed of 100-200 operations per second (+ load from the main node operation) this translates into dozens of hours of continuous operation per file-walker pass.

Alexey · December 25, 2023, 3:11am

When you perform this action:

You may try to switch it to false:

pieces.enable-lazy-filewalker: false

save the config and restart the node.

Ruskiem · December 25, 2023, 5:31am

wow im impressed You actually took time to look into that,
please check, as Alex said, if dissableing lazy-filewalker
( pieces.enable-lazy-filewalker: false) makes any differences for better?
im super curious.

snorkel · December 25, 2023, 7:05am

@Mad_Max
Nice man; you are the first one that digs deep into this File Walker. You should post your findings in “Tuning the file walker” thread. Or a moderator move this part of conversation there.

I’m curious if this applies to Linux/ext4 fs too, because it’s painful slow there too, but verry dependent on RAM. In 1GB system, it tooks like a day per TB, in higher RAM systems it tooks like an hour per TB, but still it’s verry slow compared to your findings.
We always accepted the official explanations - there are a lot of small files and going through them all takes time and resourses -, but no one digged deep to see why. Everybody just accepted it and tried our best to improve our systems by tuning the fs, by adding more RAM or cache of different types. Or… the official not-recomanded way - just turn FW off. But this comes with potential problems, aka this thread “Disk usage discrepancy” and many others.
Storj devs should look into this ASAP because bigger and bigger HDDs are comming, and the ingress is increasing each month.

daki82 · December 25, 2023, 8:19am

sidenote: primocache gets not bypassed, at least for part of ~50% of read data stream.

Toyoo · December 25, 2023, 5:36pm

Patches are welcome.

We know that on Linux it’s going “fast” if metadata fits in RAM, and “slow” otherwise, with the threshold depending on the settings of file system used. I managed to optimize my nodes so that the file walker takes around 8 minutes per terabyte of pieces in the “fast” scenario, and I found it good enough for myself. This was discussed on the forum many, many times, with so far nobody taking the challenge to writing reliable code to improve the situation. Maybe Mad_Max will take this challenge?

snorkel · December 25, 2023, 8:31pm

We already have a team of developers that is writing code and knows insides and outsides of the storagenode code, the Storj team. They should focus on node optimisations because the nodes are the core of this entire business, well… was untill the datacenter tire came in, but still…

Toyoo · December 25, 2023, 11:57pm

I looked into the problem myself. It would be somewhat easy if you could assume, let say, just a modern Linux with ext4, and not worry about reliability so much. But making it work on many operating systems (including these outdated kernels on NAS units!), even more different file systems, all that while making reliability a point is actually a bit of a challenge. Like, right now a node can be recovered from some pretty bad hardware failures, we’ve seen it on the forum many times.

It annoys me a bit that the node code could indeed be made perform probably 5× to 10× less I/O operations for downloads and uploads, and replace file walkers with single file scans. But I’m willing to trade my annoyance off for something that is known-working quite reliably on 23k nodes already, because I know doing this optimizations as well as the current implementation is hundreds of man-hours of a skilled software engineer time, which nobody will pay for now.

Plus, it would have to be a person willing to work with Windows, because obviously the worst implementation of a file system is the one used by the most vocal operators on the forum, and who complain the most if something doesn’t work

snorkel · December 26, 2023, 7:48am

If the OS itself can read these files quickly with it’s subsystems, could these subsystems be used to replace the FW? I mean, the code for reading files and metadata already exists in each OS. Why reinvent the wheel?

snorkel · December 26, 2023, 8:05am

I also wonder what happens when the node is full and the satellites have to stop sending data? If the sats have the “wrong” values of free disk space shown in the “Average disk space used this month”, which is less than “Total disk space used” with like 700 GB on all my nodes, the sats will keep sending data above the alocated space? In this case, we should have a minimum of 1TB free space unallocated for the node, just to be safe. I’ve seen people ignoring the 10% recommendation, and going lower. Maybe watching your nodes weekly can give you a usable value of X GB or X TB of needed free space, because indeed the 10% starts to be too much for the above 10TB HDDs.

jammerdan · December 26, 2023, 8:10am

Exactly.

I see that currently on some nodes, where the filewalker refuses to finish successfully. This means that the trash values are totally wrong for these nodes as trash gets cleared out frequently. However the new number don’t get transmitted to the satellite which means the node is considered full while it has plenty of space. I have nodes that report 1 TB trash while a du shows only 50GB.
The current filewalker system is complete garbage.

Dunc4n1d4h0 · December 26, 2023, 1:47pm

That topic triggered me to check what is taking space in all those folders:

Files like that, dated to 2020 (yes, it’s old node) are in every blobs/XXX/YY folder.
Am I to assume that someone has kept the data for 3 years (a lot of data) or that it is junk that shouldn’t have been there for a long time?

Dunc4n1d4h0 · December 26, 2023, 2:30pm

Toyoo:

I looked into the problem myself. It would be somewhat easy if you could assume, let say, just a modern Linux with ext4, and not worry about reliability so much. But making it work on many operating systems (including these outdated kernels on NAS units!), even more different file systems, all that while making reliability a point is actually a bit of a challenge. Like, right now a node can be recovered from some pretty bad hardware failures, we’ve seen it on the forum many times.

It annoys me a bit that the node code could indeed be made perform probably 5× to 10× less I/O operations for downloads and uploads, and replace file walkers with single file scans. But I’m willing to trade my annoyance off for something that is known-working quite reliably on 23k nodes already, because I know doing this optimizations as well as the current implementation is hundreds of man-hours of a skilled software engineer time, which nobody will pay for now.

Plus, it would have to be a person willing to work with Windows, because obviously the worst implementation of a file system is the one used by the most vocal operators on the forum, and who complain the most if something doesn’t work

Filesystems have their own pros and cons, but node software from that point of view, just writes/reads files to disk. Now I’m reading in this thread that it does so in the worst possible way, even not using cache?

Avoiding optimalization to keep compatibility z few or less % of hardware?

Not that I would ever run something that is supposed to run reliably on Windows,
but you write that instead of focusing on optimizing performance on the most popular consumer OS you spent time making the program run on some archaic NAS?

In my opinion terrible design. Whoever you paid those “hundreds of man-hours of a skilled software engineer” it was a waste

jammerdan · December 26, 2023, 2:45pm

You have to keep in mind where they come from. At the early age with small nodes it did not matter if the filewalker implementation was good or bad. Even the worst possible implementation would have done the job.
But today we are facing nodes that are 8, 9, 12 or even more TBs in size.
And suddenly (well not really) we see that this implementation is complete garbage. They tried with the lazy filewalker but it is still horrible.
When we reach the point that nodes will get that big that they will not be able to complete the filewalker between the regular node updates, then it will get interesting.