Design draft: a low I/O piece storage

I’m fine with fragmentation! But when reading your paper, it seems to have quite a bit of focus on not fragmenting, by doing pre-allocation, only extending files smaller than 128MB, saying writes will be sequential (on the media), etc.

If over the long haul your design devolves into fragmented pack files, then I think you lose your expected 5-10x performance advantage over using the file system for pieces:

a) for reads, you still have to read the piece index and read the piece from the pack file. This may save you one seek because your directory and inode data are combined, but the OS is more likely IMO to keep designated directory and inode buffers cached vs piece index buffers, so it’s not clear without simulations which will fare better.

Pieces do not become fragmented over time in your design, but the same is true if pieces are stored as individual files. (As an aside, I just realized that if pieces are today stored as individual files, there already is an average of 2K wasted per piece, so padding pieces to 4K for DIO with your design is equivalent.)

b) for writes, things start out nice and sequential (on media), but eventually it appears pack files will become fragmented anyway.

Here’s an outlandish idea: have you considered having one pre-allocated file with all the piece data and managing free space within it yourself? If you use O_DIO, you can read and write pieces directly, avoiding the buffer cache and RMW. You’d have to maintain a free space bit map, where each bit represents 4K.

When deciding where to start writing pieces, find the biggest hole and start there. When it’s full, repeat. Since piece I/O doesn’t have any locality of reference, it doesn’t matter where you write pieces: it’s always going to take a random read to get them.

This would also let you do incremental compacting. You could move pieces one-at-a-time or move large blocks of pieces. You’d have a lot of flexibility on the implementation of this, allowing to optimize over a wide range of throughput or latency.

So, again, as long as the node has at least 150 MB of unused RAM per 1 TB of pieces, there is enough RAM to hold the piece index file and all necessary direntries and inodes of pack files in cache, so no need to seek for any of those.

This differs from the current situation where you need at least 1 GB of unused RAM per 1 TB of pieces (for a typical ext4 setup) to cache all necessary direntries and inodes.

And again, for the case where the node does not have enough RAM to cache the piece index file, it’s still saving one seek compared to the current approach.

Indeed. I focused on fragmentation only because pack files will become fragmented with time, so I have to prove it won’t affect operation, i.e. it will be equivalent to current piece files not fragmented. Just that.

Yeah. Too much work to make it efficient. Besides, file systems do that already in a sufficient manner, which is why this proposal depends on file systems to manage disk space allocations, so my belief is that writing a piece storage on top of effectively block storage will not bring significant benefits over this proposal, while being much more complex.

I have a feeling you believe this proposal moves pieces within HDD. No, it does not. It’s not necessary. When a given piece is stored in a set of disk sectors, it’s never moved from these sectors. The logical address (piece ID, offset) changes due to compaction, but data sectors are not moved.

2 Likes

This article seems to be a nice discussion on fault tolerance in file systems, might be relevant for the design.

On ext4, XFS or ZFS I would not hesitate to store files directly in the filesystem without using a database. Not only is it rare to loose power, at least if you run your applications on dedicated servers located in a data center, but those filesystems recover amazingly well without loosing any data or missing any files.

1 Like

for NTFS, small files end up in the MFT. not in a partial used cluster. just saying.

It is related indeed, but addresses a similar concern satellite-side, with different means and trade-offs as well.

Smallest piece file in storage nodes is 1kB, which is already “large” for NTFS.

No, it’s for customers. The customer wants to upload 20M small segments (less than 64MiB each, maybe even a few kB - like thumbnails), so it would be nice to have them packed and unpacked automatically, otherwise the customer will pay for the segments more than for storage, see Understanding Storj Pricing Structure - Storj Docs

right, bad news for drives >16TB with ntfs , hope its worth the +4TB, bought an 20TB one.(8k cluster standard)

I went through it for the third time. Your writing appears to be extraordinary. It seemed that genuine interest was there, and suddenly, silence. How did it end up, is there going to be a follow-up?

2 Likes

Seems @Toyoo is right and no objections here, include you.

This thing will simply require a lot of development time to implement, and so any decisions to go into this direction must be weighted against all other priorities, like giving Storj customers better experience. Personally, I am happy that so far no substantial errors were found in the proposal, but given the complexity of the proposal, and the fact that I do not have means to contribute code (though I’d love to have time to prepare at least a proof-of-concept implementation!), I have to accept that in the current circumstances that would be it.

2 Likes

It is a pity. Sometimes, the I/O level expressed by the drives is just hilarious, if not to say brutal, and probably completely unnecessarily is causing excessive carbon footprint. On the other hand, I should not be surprised of such a brutality, taking into account that some developers here seem to have a ‘large-scale computing’ background. :- ) Nevertheless, I guess some paragraphs from your paper sooner or later might still be implemented. In such a case, I guess @Alexey should ping Mr. @bre and his SPBV (Special Purpose Bounty Vehicle). All in all, we are all, saving for pro storj setups. :- )

1 Like

On @bre’s behalf

6 Likes

Oh, :- ), thank you @nerdatwork. I am sorry, I was not aware of this fact. *“In such a case, I guess @Alexey should ping Ms. @bre and her SPBV (Special Purpose Bounty Vehicle).”

1 Like

lmao :joy: ty
20 characters

2 Likes

SPBV ?

Currently I’m working with Team Community and Team Bug Bounty

Shameless plug : check out our new Bug Bounty Program page and let us know if you find vulnerabilities to report!

2 Likes

Yes, actually, SPBRV (Special Purpose Bounty Research Vehicle). I am sorry, unfortunately I do not work with any of those teams. I am mostly here due to my interest and generally for fun. Nevertheless, I believe @Toyoo’s research paper is extraordinary, and intellectual property expressed in this research paper, as I understand, expressed on behalf of your employer, perhaps possibly deserves some voluntary express of appreciation. I believe its safe to say, you can’t deny that it is extraordinary, can you?

1 Like

Thank you for clarifying.
I will bring the proposal to the team but I can’t make any promises

1 Like

Thank you, absolutely. It’s understandable. I hope @Toyoo does not rise any objections. :- ) EDIT: P.S. I hope you can keep us all posted.

Of course. We are here to help the Community :slight_smile:

1 Like

Thanks for this design doc @Toyoo, It’s a well written design, I like it. And finally I had some time to collect my thoughts.

TLDR; I like the approach, but I am not sure about the implementation.

So far I have heard about multiple attempt to improve the piece store. They are usually save the metadata of pieces in a database like system and save the raw data in custom files.

(technically it would be possible to store both metadata and data in database, but it’s quite slow. I tried it, and measured it :wink: )

Storing metadata and data in separated locations can be done in multiple different ways:

  1. choose a database which already supports it, and stores the data outside of the main indexed data.

Badger looks like sg. like this. It stores the data outside of the SST files, and most of the LSM tree operation doesn’t need to touch the binary files itself.

Jacob has proof of concept of this approach (https://review.dev.storj.io/c/storj/storj/+/12066) and I also had an earlier experiment (GitHub - elek/storj-badger-storage).

I tried to run real storagenode with this driver, but it turned out to be slow when I converted 1TB of files to this hierarchy. I had a lot of GC call. But it might be my fault, I didn’t really optimize badger.

  1. An other approach is use a ready to use database for metadata. I had an earlier experiment it with using postgres (GitHub - elek/storj-largefile-storage) and I have a local patch to switch to pebble. I run this with an other storagenode successfully, but after a while one of the satellites disqualified me (might be related to a bug in my implementation)

I didn’t fully implement the online repacking, just with CLI commands, which did the cleanup job (removing unused blobs from packed files).

Note: Apache Ozone has sg. similar with leveldb + files.

  1. Third option is what the design suggests, where both metadata and data stored in custom format. Metadata is stored similar to a database/LSM tree with the help of some kind of journal.

While it has some advantage (doesn’t required full database, just the core concepts + more flexible), I have some concerns about it. I have seen similar implementation with Apache Hadoop, and it was not without additional problems. Implementing all the pieces can be hard (I have seen very slow journal reading after a problem, which caused other problems).

So my preference is either 2 or 1. I am afraid of the complexity of implementing custom journaling.

One more thoughts: In the approach of 2, I started with fully separated files, and I combined/packed them only after a while. It’s a very useful approach, as it can support both hot files (which are deleted very soon after the creation) and cold files (which are not touched for a long time)

7 Likes