Design draft: a low I/O piece storage

daki82 · January 9, 2024, 7:39pm

This may not be the actual case with NTFS…
Also no matter what or how you use storage, fragmentation will happen after some time running storagenode.
an individual defragmenting tool included would solve some problems.edit nevermind, not needed with 256mb files propably long time.
( if you consider manual start of defragmentation, running for some weeks, as a problem)
good luck with various file systems.
Also defragmentation hints should be in the docks,if not already.

Toyoo · January 9, 2024, 11:40pm

Nope. We are trading off this risk (with a potential way to heuristically recover contents) for speed. A good thing is that FALLOC_FL_COLLAPSE_RANGE is very fast, so the risky time window is short.

I was thinking about this in the context of btrfs, but yeah, copy-on-write file systems would be indeed at big advantage here with an implementation based on ioctl_ficlonerange. That would indeed be amazing!

Metadata for the file system would become insignificant. For 20 TB worth of pieces with the current average segment size this means 56M files. For the current design with ext4 this means 56M inodes+direntries worth 18 GB (at 316 bytes per file, assuming perfect placement on disk), so impossible to cache on a 2GB raspberry. For the proposed design this means 75k to 150k packfiles, so inodes+direntries take <50 MB. 150k is quite a bit less than 56M, almost three orders of magnitude, and quite cacheable.

The piece index file will indeed not fit in memory indeed, so it will have to be read from disk for piece access. But it’s going to be a single read because it’s a hash table. A single two-letter directory with the current design for 20 TB worth of pieces will contain an average of 55k files. At 60 bytes per direntry this means that the directory’s htree itself will take 3.5MB just for leaf nodes (and likely more unless extremely well-balanced), and thus potentially require multiple seeks to find the piece file (even if some of the tree’s level will be cached).

One unknown factor is extent trees for pack files. I assume they will be less complex than htrees as a single extent tree leaf may cover multiple pieces, and extents do not need to encode file names, so they are smaller (12 bytes as opposed to 60 bytes, so five times). So they will still be smaller than htrees. They should be cacheable if a single extent covers on average at least 3 pieces, but I don’t know whether this will be the case. Curiously, defragmentation will actually help here.

So even for raspberries with 2GB the situation should at least not worsen, maybe even improve a bit.

You would need more than 19 bits for the offset then, making the piece index data structure bigger, but yeah, adding a few bits to put the address space limit at 1GB to reduce frequency of compaction sounds like a nice trade-off, especially for file systems that can’t do FALLOC_FL_COLLAPSE_RANGE and would need a full copy. Yeah, I like this idea!

Some limit must be there to keep piece index compact though. Even if the limit would be at some very big number. And I think it should be small enough that in case someone would have to work on those files (like, for recovery purposes) with software that is not aware of holes, it should still be possible.

Fragmentation is a curious thing here. Pack files will technically become highly fragmented due to hole punching and FALLOC_FL_COLLAPSE_RANGE, but this fragmentation will only/mostly occur at the edges of pieces, not within pieces. Pieces are written sequentially, so it will be easy for the file system to place data of many pieces written one after another in a contiguous set of data blocks. But if later a single piece in a middle of a file will be collapsed, then the file system will claim this increased fragmentation of a pack file. But what matters here is just the fragmentation in the part of the pack file that represents a single piece, as we only care about reads of single pieces. And the data blocks allocated to a given piece won’t spontaneously change, so a single piece will stay unfragmented.

Running defragmentation tools on pack files will mostly be snake oil and worth the effort maybe only in the rare cases where the size of the extent tree as I described above will actually matter—in very low RAM scenarios.

Finally found the time to read about LevelDB. Yeah, they acknowledge the write amplification might be significant, apparently even over 50×. The proposed design even with the full file rewrite instead of FALLOC_FL_COLLAPSE_RANGE would keep write amplification limited to 1×, and with @Pentium100 's suggestion it could even go down.

daki82 · January 12, 2024, 11:56am

Ithink this holes of zeros are not realy a problem.

Toyoo · January 19, 2024, 8:29am

No, it applies to all file systems capable of hole punching.

s-t-o-r-j-user · January 20, 2024, 10:07am

I just skimmed twice (in a positive sense) through your paper. Am I correct that you are suggesting storenode to be mostly relying on “new metadata database” instead of on underling filesystem? A kind of architecture similar to Lustre FS metadata servers (MDS) and metadata targets (MDT) but on a micro scale?

Alexey · January 20, 2024, 10:11am

No. It could be a combination of both. We tried the DB approach - it’s failed. Because SNO are able to destroy any DB… The only FS is not enough (too slow on weird configs like BTRFS/ZFS single without any cache or NTFS under Linux, or using SMR)… So, I guess this would end up with a combination - DB and FS.

s-t-o-r-j-user · January 20, 2024, 10:39am

@Alexey, with full respect, I guess its best to address the question what he had on his mind directly to him. Again, I only skimmed through the paper. My current understanding is that @Toyoo is proposing mostly two things. The first, reducing the redundancy and dependency on the underling filesystem semantics that as I currently understand are being heavily utilized by the storagenode (especially by filewalker). And the second, a custom way to store metadata. I am running my nodes on Lustre (I am not administering it) and I do see some analogies between the proposal and Lustre as well as, to be honest, to some extend also to Ceph. :- )

Alexey · January 20, 2024, 12:03pm

actually - no. These functions are not available on all OS, only some with restrictions, (which should be considered for every single binary…)

… solved by many other DBs, include LevelDB…

yeah. we moved some customers from it to Storj…

… and them - too.

Toyoo · January 20, 2024, 12:04pm

I am not familiar with Lustre, sorry. The proposal is to move some responsibilities of managing piece storage to the storage node code, specifically to a purpose-designed database of sorts.

It is natural for off-the-shelf solutions (LevelDB, SQLite) to be the first choice: it’s the easiest way to build a prototype, the easiest to maintain, and hey, maybe they are good enough for production. But Storj Inc. had to move to a more tailored solution because it turned out LevelDB turned out not be good enough for production.

The file system-based approach is already a move towards bespoke solutions: the current file system design is taking a lower-level off-the-shelf solution (the file system) that does not provide all the necessary primitive operations for operating a piece storage, and building a custom piece storage layer on top of it. For example, the fact that a storage node doesn’t simply have a single directory per satellite with millions of files and instead uses a two-level directory structure, or the existence of the trash and temp directories are elements of that layer. This solution turned out to be good enough for setups optimized specifically for Storj, or setups with a lot of spare resources.

This proposal moves it a bit further: we are still using a file system, but not to manage single pieces. The file system still manages very low-level things like allocating disk sectors to data, or making sequential writes/reads fast, but not much more.

The end game would be using and managing raw block storage (like Ceph’s BlueStore) , but this would make it impossible to host nodes on many consumer-oriented setups, so I find it rather unlikely.

s-t-o-r-j-user · January 20, 2024, 1:47pm

Thats why, after familiarizing myself (again, I admit, very briefly) with your research paper, I mentioned those two filesystems (Lustre and Ceph) as it seems some analogies are rather clearly visible.

Toyoo · January 20, 2024, 2:19pm

Eh, I wrote some research papers in the past, this is not it. I’d have to spend at least 10× as much time on it to get it to the level of a draft of a research paper.

s-t-o-r-j-user · January 20, 2024, 6:52pm

Looks to me more then a draft. Have to think for a moment. Maybe, I came up with some additional questions or comments, you never know.

hashbackup · January 20, 2024, 11:05pm

By reimplementing a subset of file system features in a way that is optimized for storage node use, it should be possible to reduce the amount of I/O operations performed during routine tasks like uploads and downloads 5- to 10-fold

I think your design would greatly reduce the need for seeks while creating new pieces on storage nodes since the goal is to be doing sequential writes to a preallocated file. But when retrieving pieces from a storage node, all I/O is (I think) going to be essentially random. You may be able to eliminate a seek because the directory and inode data is stored together, but IMO the read performance will not be that much different than it is now. I also might not know what I’m talking about.

Inodes store metadata information, such as user permissions, date of modification, file names, etc.

Inodes don’t store file names; that’s in the directory. Inodes can store extended attributes.

For example with the current average segment size of 10.2 MB:

We will limit the size of a pack file by putting a restriction that no piece header can start after or on the offset of 256 MiB. With the current average segment size this means approximately 726 pieces per pack file.

I found it confusing to reference average segment sizes in the document instead of average piece sizes. I’m not that familiar with the technical details, but my understanding is that a 640MB user file is split into 10 64MB segments, then each segment is erasure-coded into many more smaller pieces. To me, the average segment size seems irrelevant; it’s only the average piece size that matters at the storage node level.

I don’t see much value in padding pieces to 512 bytes, other than it lets you save 9 bits in the piece index offset field. For that small saving, you are losing an average of 256 bytes per piece. There is a later comment about rounding up to file system logical block size for hole collapsing, but I think this is the file system allocation block size, which is usually 4K, not 512.

Append offset is the offset of the first byte after the piece’s data with the highest offset (logically last).

Is that just the pack file’s size as reported by the filesystem?

An active pack file is the file currently selected for new uploads. There is only one active pack file at a time

I’d use The instead of An, since there can be only 1.

Each time a pack file is activated, the area starting from the append offset and ending at 256 MiB is preallocated to reduce initial fragmentation and potentially speed up future writes on some setups.

It would seem to be a useful goal to preallocate with the hope of reducing fragmentation. However, doing hole punching and collapsing are counter to that goal and will increase fragmentation. Eventually it will be impossible (IMO) for the filesystem to give you a contiguous 256MB unless you are also running FS defragmentation.

The downside of not using hole punching and collapsing is that space is not made available when pieces are deleted: space is only made available when a pack file is compacted into a new pack file. HashBackup (I’m the author) has to do a similar thing, with the added wrinkle that the file to be compacted might be remote, so it has to be downloaded first, packed, then uploaded. Packing will also combine multiple pack files (called “arc” files in HashBackup). To make packing efficient, there are lots of pack config options:

pack-age-days 30
pack-bytes-free 1MB
pack-combine-min 1MB
pack-download-limit 950MB
pack-percent-free 50

As a real-world data point, my build server backup has 10M blocks (pieces) stored in 668 arc (pack) files with an average block (piece) size of 4.6K, and the storage efficiency is 86%. The average block size on this server is very small because it is mostly VM images, and those have to use a 4K backup block size for dedup to work best.

As we will reuse pack files after compaction, it should be possible to keep a significant majority of pack files at least half full. This means that for 20 TB worth of pieces, we should end up with around 75k to 150k pack files.

I’d strongly suggest you not reuse pack ids. I have had a few very nasty bugs related to this. If you don’t hole punch/collapse and always use new pack ids, you can access pieces that are being compacted: they will still be in the old pack file until you delete it after finishing the pack and commiting the piece index.

HashBackup’s solution to this very similar problem is to use an SQLite database for the metadata and flat files for the data, recording the block offsets in the SQLite db. I know people here don’t care for SQLite, but HashBackup has been using it with thousands of backups for 13 years without problems. In the few times there have been database issues, it has been tracked down to customers copying databases with file system tools like cp and not taking the corresponding journal file into account.

Good luck!

Toyoo · January 20, 2024, 11:49pm

A memory-starved node saves at least two seeks: direntry and inode. Those two seeks total are probably around 10ms on average, 20ms pessimistic case.

Yeah, in this section I probably should have written “Inodes/direntries”.

Segment size is reported in official statistics, so I refered to that. But yeah, whether quoting segment or piece size should not change the substance here, so I could change it.

Yep. Wanted to make piece index file really as small as technically possible. Same trick with the granularity of timestamps, maximum ID for pack files, maximum offset of a piece, or maximum length of a piece. Seeing what is possible at the extreme case we can now think what are the reasonable trade-offs. We could be making the timestamp fields 5 bytes to keep proper unix timestamps, pack file IDs, offsets and lengths of piece files 4 bytes, but that’s additional 12 bytes per piece, increasing the amount of RAM for caching the piece index file by 25%.

Though, with file collapsing/hole punching we’d still be losing some disk space on uneven cuts.

And a side effect of nice round offsets is easier heuristic recovery in case we lose both the journal file and the piece index file, something I didn’t even thought of while writing the proposal, and only mentioned here in the thread.

I assumed preallocation of disk space, so no. Though, even without preallocation there’s a possible case where due to an unclean shutdown there’s half of a piece at the end of the pack file…

Fully contiguous, sure, it might be a problem at some point. But even if it comes in pieces of few megabytes, that should be enough to not pay the seek penalty too often. In the end, it all depends on just how much free space will node operator keep free. ZFS suggests:

Keep pool space under 80% utilization to maintain pool performance. Currently, pool performance can degrade when a pool is very full and file systems are updated frequently, such as on a busy mail server. Full pools might cause a performance penalty, but no other issues. If the primary workload is immutable files (write once, never remove), then you can keep a pool in the 95-96% utilization range. Keep in mind that even with mostly static content in the 95-96% range, write, read, and resilvering performance might suffer.

Storage nodes are recommended to keep 10% of free space, which is a similar recommendation. We’d probably need to run some simulations to estimate the effect here — something I didn’t do because it’s not a research paper (@s-t-o-r-j-user )

There’s also an option of not using file collapsing, and doing full rewrites. With @Pentium100’s suggestion this becomes quite enticing as an acceptable trade-off, as this approach would also reduce file system fragmentation.

Pack files are reused in the physical sense, but their IDs change during compaction.

Thank you a lot for the review, and for the data on packing!

hashbackup · January 21, 2024, 1:23am

But on a memory-starved system, even if the piece index file is mmap’d you may end up doing a seek to get the piece info. Then you’re only saving 1 seek instead of 2. And it’s not like every directory/inode access is going to be a seek because the FS will be caching these too.

I have personally not had good luck with mmap, on any OS. The OS doesn’t treat it like a high-priority cache and considers it just a file, so if another application has a large malloc’d data structure, kernels tend to want to keep that in memory and throw out mmap’d blocks.

Detailed Storj stats has data about pieces via the data.json link. For us1, storage_remote_bytes * 2.7 (expansion factor) / storage_total_pieces is 590663 (when I checked), giving the average piece size. storage_total_pieces is 78x larger than storage_remote_segments, so the difference seems like a big deal to me.

I guess I don’t get that: you’ve done a collapse on all the deleted pieces, so all you have left is a few undeleted pieces that are guaranteed to be fragmented in the filesystem. What’s the advantage of reusing this old, fragmented pack file vs deleting it and starting with a new one? The preallocation on the old one is already messed up, right?

daki82 · January 21, 2024, 7:43am

yes, but the filewalks are geatly improved, if they do not need to scan all pieces and scan packfiles instead. may i suggest not to fill the drive buffer at once? maybe 196Mib 128Mib packfiles? to let some ressources for reads and walks?

Toyoo · January 21, 2024, 4:28pm

That depends on just how much memory starved you are, which is why the proposal considers systems with 150 MB to 1 GB of RAM per 1 TB of pieces stored. My notes refer to this specific range. At this range the piece index file will likely stay in cache.

If you have in mind systems below 150 MB per 1 TB of pieces stored, then the proposal still improves the situation, though indeed the improvement is smaller.

I do have some good experience with mmap. There’s probably a lot of factors here, but I did see cases where mmapping made things work 10-20% faster. Besides, mmap here is mostly about ease of implementation as opposed to speed: explicitly managing reads and writes in presence of concurrent access will be more difficult.

If you never reuse the pack file, it will shrink and shrink until it keeps a very small number of pieces. If the same happens to many pack files, you’re back to a huge number of direntries and inodes.

Besides, pieces will not get more fragmented because of punching holes or file collapsing. The pack file as a whole—sure, but we never need to read more than one piece from a pack file at a time. So the only impact I expect is bigger extent trees, which does have impact on speed, but my intuition here is that the impact will be small. Just intuition, not a measured observation, sorry—I’d have to prepare a full experiment to measure that, and I can’t really devote more of my hobby time for that.

hashbackup · January 21, 2024, 5:27pm

A fragmented pack file doesn’t affect reads because they’re always random, but if you reuse a fragmented pack file, it will affect writes. And it seems like the more “cycles” on a pack file, the more fragmented it will get over time. This also applies to Pentium’s idea of writing past 256MB.

You could consider pre-allocating all of the pack files as part of disk initialization. Then presumably they will be contiguous and stay that way if you don’t do hole punching and collapsing. But that also might cause writes to become read-modify-writes when that isn’t needed. The only way I read about to avoid RMW is to use O_DIO. However that requires writes to be allocation block sized and aligned, so you’d have to pad pieces to 4K instead of 512 bytes. Since the average piece size is around 500K, maybe it’s acceptable to lose an average of 2K per piece. You’d have to bundle the header and piece together and pad that to 4K; it wouldn’t make sense to pad each to 4K. The good thing about O_DIO is it avoids the FS buffer cache, which would seem to be a good thing for both piece reads and writes, assuming piece I/O has no temporal locality, ie, caching doesn’t help.

Toyoo · January 21, 2024, 6:48pm

I still believe this should not matter. Let me anchor the discussion on an example—abstract, though I hope not too much. Apologies if I’m repeating myself too much. Besides, this is a good exercise for me in learning technical writing.

Let’s go over a life cycle of a single pack file, going by the version that is currently under review, without taking into account any of the changes suggested in this thread. When a new pack file is created, it is preallocated to 256 MB. One character will represent 16 MB. Let’s assume the node is still pretty fresh and there’s plenty of disk space free, so we manage to actually make a fully contiguous allocation:

Hard disk                 .......ABCDEFGHIJKLMNOP...........................
(physical sectors)

Pack file:                ABCDEFGHIJKLMNOP
(logical address space)

Pieces are written to the pack file one by one, so at some point the pack file is full. The node is switching writes to other pack files. At some point, trash chore happens and we can punch holes:

Hard disk                 .......A.CD..GHIJKLMNO............................
(physical sectors)

Pack file:                A_CD__GHIJKLMNO_
(logical address space)

As the pack file is still bigger than 128 MB, it’s not compacted yet. Dots represent unallocated sectors, so here the space immediately is available for new files. Several GC/trash cycles later, we managed to hole-punch enough of the file to trigger compaction. Before the compaction:

Hard disk                 .......A..D..G...KLMNO............................
(physical sectors)

Pack file:                A__D__G___KLMNO_
(logical address space)

We compact using FALLOC_FL_COLLAPSE_RANGE and get the following:

Hard disk                 .......A..D..G...KLMNO............................
(physical sectors)

Pack file:                ADGKLMNO
(logical address space)

And then we preallocate a new section of 128 MB at the end of the pack file:

Hard disk                 .......A..D..G...KLMNO........QRSTUVWX............
(physical sectors)

Pack file:                ADGKLMNOQRSTUVWX
(logical address space)

The pack file is fragmented: it indeed consists of 5 fragments now. However, as pieces stored in the old part of the file (“ADGKLMNO”) were written without fragmentation, and their sectors were not physically moved on the hard disk, each piece separately is still not fragmented. At the same time, new pieces written to the “QRSTUVWX” part of the pack file will not be internally fragmented either, as the disk space was preallocated, and so the file system has an upfront information to find a contiguous space large enough to accomodate the whole new allocation. Therefore the problem of fragmentation does not affect us as long as there are still 128 MB-long sequences of free sectors.

Now, after a while, this last condition stops becoming true, and no 128 MB-long sequences of free sectors exist anymore. My intuition here is that this is a case similar to ZFS filling up. If ZFS is fine with fragmentation after many cycles of writing and deleting files at 80% of disk usage, it should be good enough for us as well. I actually believe for ZFS this is a conservative estimate because they have to support a wide range of write/delete patterns, whereas Storj patterns are easier: for example, no random writes inside pieces. In any case, as long as new preallocations end up with chunks of at least 1MB, most pieces will end up in one or two fragments on disk. I can’t prove this without proper simulations, but my belief is that at 90% of disk usage we should be still getting a small number of decently large fragments for new preallocations. Hence the impact of slightly fragmented preallocations on writes will be small.

Does this argument convince you?

hashbackup · January 21, 2024, 7:49pm

I’m fine with fragmentation! But when reading your paper, it seems to have quite a bit of focus on not fragmenting, by doing pre-allocation, only extending files smaller than 128MB, saying writes will be sequential (on the media), etc.

If over the long haul your design devolves into fragmented pack files, then I think you lose your expected 5-10x performance advantage over using the file system for pieces:

a) for reads, you still have to read the piece index and read the piece from the pack file. This may save you one seek because your directory and inode data are combined, but the OS is more likely IMO to keep designated directory and inode buffers cached vs piece index buffers, so it’s not clear without simulations which will fare better.

Pieces do not become fragmented over time in your design, but the same is true if pieces are stored as individual files. (As an aside, I just realized that if pieces are today stored as individual files, there already is an average of 2K wasted per piece, so padding pieces to 4K for DIO with your design is equivalent.)

b) for writes, things start out nice and sequential (on media), but eventually it appears pack files will become fragmented anyway.

Here’s an outlandish idea: have you considered having one pre-allocated file with all the piece data and managing free space within it yourself? If you use O_DIO, you can read and write pieces directly, avoiding the buffer cache and RMW. You’d have to maintain a free space bit map, where each bit represents 4K.

When deciding where to start writing pieces, find the biggest hole and start there. When it’s full, repeat. Since piece I/O doesn’t have any locality of reference, it doesn’t matter where you write pieces: it’s always going to take a random read to get them.

This would also let you do incremental compacting. You could move pieces one-at-a-time or move large blocks of pieces. You’d have a lot of flexibility on the implementation of this, allowing to optimize over a wide range of throughput or latency.