Design draft: a low I/O piece storage

Toyoo · January 7, 2024, 7:19pm

Hi, I’ve finally written down some of my thoughts on designing a low-I/O piece storage. It might be a little bit overengineered, but I wanted to start from the most tight solution and put it into discussion, so that the potential trade-offs are clearly visible. I cannot put it into Gerrit , so just made a pull request on Github here: docs/blueprints: low i/o piece storage by liori · Pull Request #6670 · storj/storj · GitHub

(I see that in the meantime blueprints were moved to a separate repo. Oh well…)

Tagging @thepaul.

striker43 · January 7, 2024, 9:05pm

Wow, really nice work!
I highly appreciate your efforts to make nodes more efficient and reduce the wear on hardware.

One question came to my mind when I was reading your proposal. How would you do the transition to the new approach? As the way nodes work right now is totally different to the new approach, I guess it will not be possible to convert existing nodes into the new nodes?

Toyoo · January 7, 2024, 9:21pm

Yeah, this is a bit of a problem. Probably roughly the same way pieces were migrated from v0 to v1, that is, maintain both at the same time, with new pieces going to the new design, and old ones retrieved from the old one. Maybe amended with some sort of a slow background copy happening…

daki82 · January 7, 2024, 9:46pm

There are some parts that remember me on defer writes like primocache does.
How is data loss when power blackout prevented? (sorry to lazy to read all, im no programmer)
Pretty much work tough. I Appreciate.

Toyoo · January 7, 2024, 11:42pm

That’s one part of the design where discussion is needed: what kind of data loss is acceptable as a trade-off for less I/O. Right now the document pretty much avoids any synchronized writes, so it depends on the OS/file system implementation when writes will hit a hard disk. I believe it would still be way within the acceptable 2% level of losses in case of problems, with the only operation that is more risky being the compaction procedure.

Ruskiem · January 7, 2024, 11:52pm

IMHO? i can tell You 1 thing, it really looks BAD for read in this github repo, One just clicks that and sees small letters, bad font. it would be better if You post all text here for better visuality, for read and discussion…

Toyoo · January 7, 2024, 11:55pm

This is a PR where you can comment. You can display it rendered too: https://github.com/liori/storj/blob/67913aa65aadcd868dcc1e9ef5cbe029bfae295e/docs/blueprints/low-io-piece-storage.md

snorkel · January 8, 2024, 12:23am

I’m trying to understand all of it (non programmer), but Nice work! I’m realy eager to see some test results in the near future.

daki82 · January 8, 2024, 12:52am

Easy. None. Maybe an option to syncwrite on/off? for people with UPS or not?

Aside from that, with stable power, an defer write timeframe of 5s or 6MB would be ok.

Roxor · January 8, 2024, 3:23am

Does the use of pack files trade reduced filesystem-level overhead… for data-level fragmentation (within the pack files)? It sounds like data that customers may try to download may be sprayed like a mist through several pack files (intermingled with data from other customers): requiring near-random-IO to retrieve?

Today there may be filesystem overhead to jump to a customers discrete files: but when found they’re each a speedy linear read.

Interesting idea, and fun read!

Alexey · January 8, 2024, 3:28am

For some reason each used file in the design reminds me SQLite, it has a journal too and a temporary opened file, where all operations happening, then they merged into the DB file on clean shutdown.
And we all remember, how fragile these files on unclean shutdown (which happen very often), or with a concurrent access (in the design it is a main way to operate them).

Alexey · January 8, 2024, 3:56am

Perhaps the most efficient storage was based on LevelDB in Storj v2 (Introducing KFS: a local file store inspired by Kademlia), but it has had a hard limit of max allocated size per node, and since it was a custom solution, the repair tools for LevelDB didn’t work well. It also used more memory to operate a node (exactly like you suggests) and we have had many complains about high memory usage.

Pentium100 · January 8, 2024, 9:25am

Oh, I remember that. It annoyed me enough to modify the node so it saves the data as files when I had not done anything with nodejs before. My modded node worked though, so there’s that.

This proposal looks interesting, but with

If I got this right, a small node could lose 2% of data due to a single crash? Ouch. I also remember the compaction procedure or something similar from the v2 days. It would load the drives 100% and never stop (though I later found out that the drives were SMR, but still, my file based storage worked much better).

Now, an append-only file for storing the files and the file index looks OK, in a crash all the old data should survive, hopefully. The compaction procedure looks dangerous, especially if the system crashed at exactly the wrong time. Also, no idea how all of this would work with multiple threads.

I mean, if there are two files being uploaded at the same time, how does the node decide the order to write them to the pack file? You can write multiple files to a filesystem at the same time, no problem, but trying to append two sets of adata to a file at the same time could result in problems.

thepaul · January 8, 2024, 7:17pm

This is great. It’s a really well put-together blueprint. We’ve discussed using journaling before but no one has fleshed the idea out with this much detail. I didn’t even know FALLOC_FL_COLLAPSE_RANGE existed; that certainly makes this approach more interesting, at least for the Linux side.

I expect I (or someone) will be looking at this more in the weeks to come. It might need to compete with a proof-of-concept experiment using BadgerDB.

No, I believe xe meant that in a reasonable situation data loss would be much less than 2% (2% being the level of data loss where node reputation begins to suffer sharp penalties).

This is the reason for keeping each uploaded piece in memory until it’s committed, at which point it gets written to the journal. Only a single thread is needed to write to the journal, since the full contents of each input will be immediately available.

Pentium100 · January 8, 2024, 7:48pm

A single write thread would limit the throughput, because it would have to wait for the drive to respond and it would not have a queue depth higher than 1.

At least this is how uTorrent 2.2.1 operates (though it also reads using a single thread) and without SSD caching the speed is rather limited - I remember getting the same speed from 4 drive RAID0 as from a single drive.

Toyoo · January 8, 2024, 11:05pm

Each piece is stored as consecutive bytes within a single pack file. No piece is spread across many packfiles.

This is a key-value database. A single-purpose one, but still a database. Databases have some common design patterns, so some degree of similar solutions should be expected.

File systems are also a form of databases, yet somehow we trust them with the current design.

LevelDB seems to provide transactions and other features not necessary for storage nodes. I don’t really feel like studying its design to evaluate it though, sorry, so cannot compare the proposed design to LevelDB.

Yep, that’s the only part that has a significant risk. I suspect that it should still be possible to write a recovery tool that would scan a pack file looking for piece headers (they do have a pretty unique structure: they have to be a valid protobuf data structure with values that make sense for storage nodes) and heuristically recover pieces. And adding an “under compaction” bit to pack files would allow easy detection of potentially damaged ones.

Thank you!

Yep! Given that any reasonable node will store thousands of pack files, each at most ~256 MiB, loss of a single one in a rare crash event should be acceptable.

Yep! An alternative approach would allow multiple active pack files, but I don’t think this is necessary.

Given that the upload path does not have any synchronised writes, I expect the throughput of a sequence of uploads to be close to sequential write performance of HDDs, which is often >100 MB/s. Besides, this is a single write thread per satellite. Again, an alternative with multiple active pack files is also possible.

Toyoo · January 8, 2024, 11:28pm

And one more option is to use pwrite and write multiple pieces to a single file at a time… again, I don’t think it’s worth it, as then lost races will fill up the pack files faster and result in more compaction.

Pentium100 · January 9, 2024, 6:58am

File systems were extensively tested over a long time and most bugs hopefully fixed. This proposition does not replace a file system - it works on top of an existing file system.

It’s also additional IO, kind-of what was with Storj v2. I do not know how to fix this though. Without it, the files will gradually shrink as data is deleted and at some point there may be one piece per file left, which results in just a complicated version of the current method.
OTOH, copying a bunch of files increases IO and invalidates part of the cache, it will probably make the node run a bit slower because of that just after the compaction procedure.

The one advantage I see (for me) is that rsync would be faster if I ever wanted to move my node somewhere else.
As my node runs on ext4 on top of zfs (zvol), I doubt the preallocation would do anything.

Another thought - as I understand, the pack files intially get allocated, filled up to 256MB and another file gets allocated. In theory this should minimize fragmentation. Let’s say the compaction procedure manages to shrink 100 pack files. Those files will later be reused for writing resulting in fragmentation.
Wouldn’t it be better if the compaction procedure put multiple partially-full files into fewer full files and one partially-full? Say, taking 100 files and compacting them into 50 full files and one partially full? OTOH, that would probably mean even more IO and risk though, hopefully, fsync would not lie.

Another problem (not for me though) is the memory usage. Right now, my node with 25.59TB of data uses about 790MB of RAM, all other memory is buffers/cache (node VM has 15GB, the host has 192GB). IIRC there are people running nodes on Raspberries with 2GB or so of memory. If it is assumed that the piece index file will be cached along with the filesystem metadata, it may require too much memory and not work correctly.

Toyoo · January 9, 2024, 9:57am

Did Storj v2 use FALLOC_FL_COLLAPSE_RANGE?

FALLOC_FL_COLLAPSE_RANGE does not copy them.

No, because then you cannot use FALLOC_FL_COLLAPSE_RANGE.

And with this proposal you should be able to operate this node with a VM of 4GB, maybe 5GB, because you will not need 14 GB of caches.

Pentium100 · January 9, 2024, 11:23am

Oh, so there’s no backup if the operation fails/system crashes? I somehow skipped over that. On one hand, it will definitely be faster and less IO, on the other hand stuff sometimes crashes. I guess using this on zfs directly would be OK, since zfs would still copy stuff, probably, but my setup of ext4 on zvol would not do that.

No reason for me to do that. What I meant as that this may result in worse performance for those who do not have a lot of RAM (2GB or whatever the minimum currently is). My host has 192GB, most of that is used for cache (not just Storj).

With the new system, 4GB or a bit more will be taken up by the new metadata (file index etc, the proposal says 4GB for 20TB) and there will still be the need to cache the metadata of the file system, because there will be lots of files with data. Fewer than now, by a significant margin, but still.

The proposal looks cool though, maybe it will work better than what is now.

Hmm… I got an idea - it’s probably something you have considered, or maybe not (maybe it’s so stupid that you did not even think of it :))

Compaction is risky, as least now that I now there’s no backup when doing it. It also does not copy the data as I understand now, so it won’t defragment or fragment a file. However, we want to reuse the files that have free space to not end up with lots of small files after many pieces are deleted.
Since a file with a hole punched in it (when deleting a single file) takes up as much space as the real data inside it, why not just leave the holes and just append new data to the end? I mean writing past the 256MB offset, as long as the real file size does not exceed 256MB or whatever. Do not run the compaction at all or only do it if the apparent file size goes over 1GB or whatever.