Design draft: a low I/O piece storage

Toyoo · January 21, 2024, 8:45pm

So, again, as long as the node has at least 150 MB of unused RAM per 1 TB of pieces, there is enough RAM to hold the piece index file and all necessary direntries and inodes of pack files in cache, so no need to seek for any of those.

This differs from the current situation where you need at least 1 GB of unused RAM per 1 TB of pieces (for a typical ext4 setup) to cache all necessary direntries and inodes.

And again, for the case where the node does not have enough RAM to cache the piece index file, it’s still saving one seek compared to the current approach.

Indeed. I focused on fragmentation only because pack files will become fragmented with time, so I have to prove it won’t affect operation, i.e. it will be equivalent to current piece files not fragmented. Just that.

Yeah. Too much work to make it efficient. Besides, file systems do that already in a sufficient manner, which is why this proposal depends on file systems to manage disk space allocations, so my belief is that writing a piece storage on top of effectively block storage will not bring significant benefits over this proposal, while being much more complex.

I have a feeling you believe this proposal moves pieces within HDD. No, it does not. It’s not necessary. When a given piece is stored in a set of disk sectors, it’s never moved from these sectors. The logical address (piece ID, offset) changes due to compaction, but data sectors are not moved.

Toyoo · January 25, 2024, 11:27pm

This article seems to be a nice discussion on fault tolerance in file systems, might be relevant for the design.

On ext4, XFS or ZFS I would not hesitate to store files directly in the filesystem without using a database. Not only is it rare to loose power, at least if you run your applications on dedicated servers located in a data center, but those filesystems recover amazingly well without loosing any data or missing any files.

daki82 · January 26, 2024, 9:42pm

for NTFS, small files end up in the MFT. not in a partial used cluster. just saying.

Toyoo · January 26, 2024, 10:36pm

It is related indeed, but addresses a similar concern satellite-side, with different means and trade-offs as well.

Smallest piece file in storage nodes is 1kB, which is already “large” for NTFS.

Alexey · January 27, 2024, 8:28am

No, it’s for customers. The customer wants to upload 20M small segments (less than 64MiB each, maybe even a few kB - like thumbnails), so it would be nice to have them packed and unpacked automatically, otherwise the customer will pay for the segments more than for storage, see Understanding Storj Pricing Structure - Storj Docs

daki82 · January 28, 2024, 9:13pm

right, bad news for drives >16TB with ntfs , hope its worth the +4TB, bought an 20TB one.(8k cluster standard)

s-t-o-r-j-user · February 3, 2024, 10:14am

I went through it for the third time. Your writing appears to be extraordinary. It seemed that genuine interest was there, and suddenly, silence. How did it end up, is there going to be a follow-up?

Alexey · February 3, 2024, 1:03pm

Seems @Toyoo is right and no objections here, include you.

Toyoo · February 3, 2024, 1:15pm

This thing will simply require a lot of development time to implement, and so any decisions to go into this direction must be weighted against all other priorities, like giving Storj customers better experience. Personally, I am happy that so far no substantial errors were found in the proposal, but given the complexity of the proposal, and the fact that I do not have means to contribute code (though I’d love to have time to prepare at least a proof-of-concept implementation!), I have to accept that in the current circumstances that would be it.

s-t-o-r-j-user · February 3, 2024, 2:04pm

It is a pity. Sometimes, the I/O level expressed by the drives is just hilarious, if not to say brutal, and probably completely unnecessarily is causing excessive carbon footprint. On the other hand, I should not be surprised of such a brutality, taking into account that some developers here seem to have a ‘large-scale computing’ background. :- ) Nevertheless, I guess some paragraphs from your paper sooner or later might still be implemented. In such a case, I guess @Alexey should ping Mr. @bre and his SPBV (Special Purpose Bounty Vehicle). All in all, we are all, saving for pro storj setups. :- )

nerdatwork · February 3, 2024, 2:20pm

On @bre’s behalf

s-t-o-r-j-user · February 3, 2024, 2:29pm

Oh, :- ), thank you @nerdatwork. I am sorry, I was not aware of this fact. *“In such a case, I guess @Alexey should ping Ms. @bre and her SPBV (Special Purpose Bounty Vehicle).”

bre · February 5, 2024, 10:07am

lmao ty
20 characters

bre · February 5, 2024, 10:13am

SPBV ?

Currently I’m working with Team Community and Team Bug Bounty

Shameless plug : check out our new Bug Bounty Program page and let us know if you find vulnerabilities to report!

s-t-o-r-j-user · February 5, 2024, 10:44am

Yes, actually, SPBRV (Special Purpose Bounty Research Vehicle). I am sorry, unfortunately I do not work with any of those teams. I am mostly here due to my interest and generally for fun. Nevertheless, I believe @Toyoo’s research paper is extraordinary, and intellectual property expressed in this research paper, as I understand, expressed on behalf of your employer, perhaps possibly deserves some voluntary express of appreciation. I believe its safe to say, you can’t deny that it is extraordinary, can you?

bre · February 5, 2024, 10:59am

Thank you for clarifying.
I will bring the proposal to the team but I can’t make any promises

s-t-o-r-j-user · February 5, 2024, 11:38am

Thank you, absolutely. It’s understandable. I hope @Toyoo does not rise any objections. :- ) EDIT: P.S. I hope you can keep us all posted.

bre · February 5, 2024, 11:55am

Of course. We are here to help the Community

elek · February 5, 2024, 11:59am

Thanks for this design doc @Toyoo, It’s a well written design, I like it. And finally I had some time to collect my thoughts.

TLDR; I like the approach, but I am not sure about the implementation.

So far I have heard about multiple attempt to improve the piece store. They are usually save the metadata of pieces in a database like system and save the raw data in custom files.

(technically it would be possible to store both metadata and data in database, but it’s quite slow. I tried it, and measured it )

Storing metadata and data in separated locations can be done in multiple different ways:

choose a database which already supports it, and stores the data outside of the main indexed data.

Badger looks like sg. like this. It stores the data outside of the SST files, and most of the LSM tree operation doesn’t need to touch the binary files itself.

Jacob has proof of concept of this approach (https://review.dev.storj.io/c/storj/storj/+/12066) and I also had an earlier experiment (GitHub - elek/storj-badger-storage).

I tried to run real storagenode with this driver, but it turned out to be slow when I converted 1TB of files to this hierarchy. I had a lot of GC call. But it might be my fault, I didn’t really optimize badger.

An other approach is use a ready to use database for metadata. I had an earlier experiment it with using postgres (GitHub - elek/storj-largefile-storage) and I have a local patch to switch to pebble. I run this with an other storagenode successfully, but after a while one of the satellites disqualified me (might be related to a bug in my implementation)

I didn’t fully implement the online repacking, just with CLI commands, which did the cleanup job (removing unused blobs from packed files).

Note: Apache Ozone has sg. similar with leveldb + files.

Third option is what the design suggests, where both metadata and data stored in custom format. Metadata is stored similar to a database/LSM tree with the help of some kind of journal.

While it has some advantage (doesn’t required full database, just the core concepts + more flexible), I have some concerns about it. I have seen similar implementation with Apache Hadoop, and it was not without additional problems. Implementing all the pieces can be hard (I have seen very slow journal reading after a problem, which caused other problems).

So my preference is either 2 or 1. I am afraid of the complexity of implementing custom journaling.

One more thoughts: In the approach of 2, I started with fully separated files, and I combined/packed them only after a while. It’s a very useful approach, as it can support both hot files (which are deleted very soon after the creation) and cold files (which are not touched for a long time)

s-t-o-r-j-user · February 5, 2024, 6:42pm

It looks like they don’t like it. What a pity. Should you be seeking my honest advice in such a case, I would probably consider removing it from GitHub. However, it’s genuinely regrettable because even though implementing it might be challenging at the moment, in the medium to long term, the environment could undergo significant changes, and such intellectual property could become an even greater asset than it is now.