Design draft: a low I/O piece storage

elek · February 5, 2024, 11:59am

Thanks for this design doc @Toyoo, It’s a well written design, I like it. And finally I had some time to collect my thoughts.

TLDR; I like the approach, but I am not sure about the implementation.

So far I have heard about multiple attempt to improve the piece store. They are usually save the metadata of pieces in a database like system and save the raw data in custom files.

(technically it would be possible to store both metadata and data in database, but it’s quite slow. I tried it, and measured it )

Storing metadata and data in separated locations can be done in multiple different ways:

choose a database which already supports it, and stores the data outside of the main indexed data.

Badger looks like sg. like this. It stores the data outside of the SST files, and most of the LSM tree operation doesn’t need to touch the binary files itself.

Jacob has proof of concept of this approach (https://review.dev.storj.io/c/storj/storj/+/12066) and I also had an earlier experiment (GitHub - elek/storj-badger-storage).

I tried to run real storagenode with this driver, but it turned out to be slow when I converted 1TB of files to this hierarchy. I had a lot of GC call. But it might be my fault, I didn’t really optimize badger.

An other approach is use a ready to use database for metadata. I had an earlier experiment it with using postgres (GitHub - elek/storj-largefile-storage) and I have a local patch to switch to pebble. I run this with an other storagenode successfully, but after a while one of the satellites disqualified me (might be related to a bug in my implementation)

I didn’t fully implement the online repacking, just with CLI commands, which did the cleanup job (removing unused blobs from packed files).

Note: Apache Ozone has sg. similar with leveldb + files.

Third option is what the design suggests, where both metadata and data stored in custom format. Metadata is stored similar to a database/LSM tree with the help of some kind of journal.

While it has some advantage (doesn’t required full database, just the core concepts + more flexible), I have some concerns about it. I have seen similar implementation with Apache Hadoop, and it was not without additional problems. Implementing all the pieces can be hard (I have seen very slow journal reading after a problem, which caused other problems).

So my preference is either 2 or 1. I am afraid of the complexity of implementing custom journaling.

One more thoughts: In the approach of 2, I started with fully separated files, and I combined/packed them only after a while. It’s a very useful approach, as it can support both hot files (which are deleted very soon after the creation) and cold files (which are not touched for a long time)