On tuning ext4 for storage nodes

Alexey · August 9, 2024, 5:29am

We use an in-memory cache to write to the database, by default it syncs every hour.

Zetanova · August 11, 2024, 7:16pm

Its not about a in-memory cache by itself.

It’s about that they solution of the sqlite lock contention issue
produced of the file-walker random IO issue.

It’s basically impossible anymore to run a node on a single HDD
dedicated or shared with an low IO process.

The solution to add file-system metadata cache in memory or separate dedicated flash storage seams more a workaround then anything else.

What I could extract from the forum, is that a node requires to overcome the file-walker issue around 1GB fs-cache per 1TB data. It makes low HW like raspberry-pi 3,4,5 already not meat this requirement and to run the node as a background task on a server seams on spare/backup drives not feasible.

Then there is the issue with the file directory fragmentation (ext4 and maybe others).
This could be solved to restructure the storage directory to add files only in the current-week directory and files from previous weeks could only be deleted but none added.

But this would require a file-locator component that would be based on a fast directory (database)

If required, even the current hash-map directory layout could be used in parallel.

I know that simpler is better, but to answer the question of the current inventory
to make full scan in random mode on the drive makes little sense.

JWvdV · August 11, 2024, 8:00pm

But that’s is what the badger cache is all about, I thought? Serves the same purpose as a read cache like LVM-hotspotcache or L2ARC.

I mean: aren’t you solving a problem, for which already a solution is in trial mode?

Zetanova · August 11, 2024, 8:42pm

I wrote him this even, the badger cache would resolve the file-walker random IO issue.
But not the directory fragmentation and stale/ghost data from partially uploaded chunks.

The badger cache would be require to use it as a file-locator, but then it would not be a “cache” anymore. It would then be more a “honey badger”

The main idea is to drop new chunks only in the latest directory example: storage/2024-32/0/chunkId
this would require a lookup in a fast map/index over a file-locator

Benefits of a file-locator in the “honey badger”

only the latest directory would require a check on restart or node crash.
on a new week, the previous week directory would be read and delete only and could be garbage collected or maybe differently optimized in the future
Multiple drives could be used per node, too extend a node or copy “weeks” slowly over
A flash+hdd mode for new/hot data and an archive hdd’s
most likely others …

Alexey · August 12, 2024, 5:33am

The filewalker will update the database only when it’s finished the scan. So, usually they should not lock a database while working.

We implemented a different cache with badger, so it is used to cache a metadata and speed up all filewalkers.

The badger cache should help a low power devices to survive. However, it requires to disable a lazy mode (because the badger even more sensitive to multiple accesses of different processes).

and

Zetanova · August 13, 2024, 3:22pm

This problem I don’t understand or better a journal “event-sourcing” would solve it.
In dotnet I could make in a hand full days, but I am at golang a noob.

Maybe I will implement a tech-demo fork.

I will go currently with the workaround with a new rsp5 the zfs+special device metadata with a 5GB/1TB ratio over a partition on a NVMe

Toyoo · August 13, 2024, 9:35pm

Maybe you can just try writing high-level pseudocode? Might be enough to show the idea.

Alexey · August 14, 2024, 8:16am

There is a limit in the badge cache implementation, which requires an exclusive access. This is also why it’s an experimental feature.

Zetanova · October 12, 2024, 7:12am

I forgot to ask, did the disabled sqlite implementation create/open the db-file with the WAL feature enabled?

sqlite creates/opens a db file in legacy mode 2001 style with WAL disabled. The writes will modify the file-data in place and they will block reads until the transaction finishes.

With WAL enabled, regular writes will not block reads on the db file.
https://sqlite.org/wal.html

This would be more then the half of my above approach description and most likely this was the issue with the storj-sqlite implementation.

The other optimization/features that a journal offers would only be necessary in high OP scenarios >50k op/s. where one file upload would generate around 2 + chunks operations. Don’t really think that even the big nodes have this action, or?

For completion here are the features that would a journal extend beside sqlite-Wal:

pass-through where events do not even gets committed into the journal and instantly get applied/processed. This is feature is near the same as a file upload would store its state in memory and writes to the sqlite db only at the end once.
Differencing pub+sync and pub+flush it is related to pub/sub and event-sourcing topic. The producer, in our case the file-upload process, would know best when it needs critical events/data to be persisted flush and processed at a later stage (can be minutes later) or when it needs to know that all processed events got fully processed sync. As an example for sync, a file upload needs to respond to the uploader and to guarantee that a follow up read-file of the uploader would succeed. As an example for flush, a delete-fileS would publish multiple events followed by a single flush.

Alexey · October 12, 2024, 7:31am

Why do you think so?

$ ls -l /mnt/x/storagenode2/storage/*.db*
-rw-r--r-- 1 root   root    75304960 Oct 12 09:59 /mnt/x/storagenode2/storage/bandwidth.db
-rw-r--r-- 1 root   root       32768 Oct 12 10:29 /mnt/x/storagenode2/storage/bandwidth.db-shm
-rw-r--r-- 1 root   root        4152 Oct 12 10:29 /mnt/x/storagenode2/storage/bandwidth.db-wal

I can see a wal files.
You also may check the code.

Zetanova · October 12, 2024, 7:52am

The sqlite of the file catalog got removed from the source and was replaced by the file-walker approach with the reason that the sqlite-db is bottle-nacking the file uploads.

Alexey · October 12, 2024, 1:50pm

It is not removed, please, check the code.
SQLite databases are still used.
However, we implemented a badger cache:

And the ultimate feature of

Please, use them!

Mitsos · October 12, 2024, 2:02pm

Here’s me complaining (yet again!) without any single reason, just out of the blue. 4 months later and we (=baremetal) still don’t have a way to configure where badger is stored.

Alexey · October 12, 2024, 2:21pm

Yeees. But if you so concerned, and run the node without docker, you can still reroute it with a symlinks.
I can help, please post a description of your environment

Mitsos · October 12, 2024, 2:25pm

Not that concerned, thank you.

PieceKeeper · December 23, 2024, 12:07pm

Hi @Toyoo, do you think the -i 65536 argument on mke2fs is still a safe value today? I ask because you wrote in another thread (Recommended GB of RAM / TB stored? - #7 by Toyoo), that the average piece size has gone down alot.

Toyoo · December 24, 2024, 11:49pm

Looking at the numbers on my nodes I do think it should be good enough for the forseeable future, but I would probably step down any new file systems to ~45000.

With hashstore on the horizon this number probably matters less now though.

alpharabbit · December 25, 2024, 12:10am

The average piece size for all of my nodes is over 200K, so no problem.

PieceKeeper · December 25, 2024, 10:05am

Thanks @Toyoo and @alpharabbit

Merry Christmas everybody

snorkel · December 26, 2024, 9:01pm

With the startup-piece-scan now reporting the pieces size and pieces number, it’s easy to calculate the average piece size.
For my 6TB node, the average piece size is: 203845 bytes.