Design draft: a low I/O piece storage

This is all interesting!

I believe it would be interesting to see what Badger’s maintainers would think about your experiences.

The network is resilient!

This is probably no longer relevant given that deletions only happen during GC now, so effectively at least a week, and possibly even two weeks later.

Nice try. :- )

(20 …)

Please do not refer to anyone on this forum as “Babe”. That is out of line.

5 Likes

(Out of topic) Not only this, but thanks for the message, cheers. :- )

Usually I don’t answer to these kind of comments. I am here only for technical discussions and helping in technical topics. Less interested about other stuff. Don’t even understand what is it all about (they=?).

But I am adding this comment, because I have a fear that my answer was not clear enough.

This design draft is exactly the kind of technical discussion what I like most, it’s also very well written, and I learned from it (for example, the FALLOC_ usage was fully new ideas for me).

I am very grateful for the conversation, and I think the conversation can be respected only with adding other views, opinions and technical details.

Cheers.

7 Likes

I think still it’s relevant. I don’t really care about small files from the last 1-4 weeks. Probably 1 or 2 millions of files, and part of them will be deleted during the time period.

If all the other (cold) files can be packed (10-30 millions of files), still I am fine.

1 Like

I usually do not engage myself into those kind of conversations on public forums, especially on forums belonging to so called “open source companies”, mostly due to the dualism (nature) of such undertakings, which of course is very fragile.

Anyway, thats great to hear, however, I sustain my opinion that it looked like you, I mean the other side of this community, did not liked this idea. The reason I got this feeling was because there were some questions addressed directly to your side and the questions were not answered even after a few days (after prolonged time frame). And on top of it, I again would like to point your attention to a potentially valuable “intellectual property” that can be potentially directly secured by your company, if possibly handled correctly. I do not think there is a reason to be shy about that, nevertheless, I admit that possibly some legal advisory might be needed in this particular case, again, nothing to be shy nor afraid. And in addition, on top of that, may I ask you a question: do you think that it is a common practice to discuss company’s product development strategy publicly?

[…] design draft is exactly the kind of technical discussion what I like most, it’s also very well written, and I learned from it […]

I dont know what to say, it seems to look that @Toyoo is pretty advanced on those topics including Storj code base and possibly he has some academic background, however, I have to admit that I dont think that this a university, I mean its probably not a post doc university course. Anyway, from my point of view, the question still remains open, do you have a special purpose research investment vehicle or you do not? I am afraid that if you do not have such a vehicle, it remains mostly an academic discussion, always a pleasure, I mean always a pleasure to make new friends.

Cheers.

Nice reading, although improving the I/O usage is quite simple and complex at the same time.
I am going to give my 2 cents about storage and how to improve the I/O of HDD.

TL:DR, databases and journals have high I/O usage and that is where we can get the most gains.

Anything that requires databases kills the I/O in a mechanical drive, you can delay the writing to wait for low I/O to commit changes at the expense of losing data in a power failure situation.

We notice this in ext4 by simply adding noatime, where we do not change the date of the file accessed, so it does not write to the FS journal…

Block size also matters, but block size can’t be aimed at average block size, it has to be on the for example 80 to 95% percentile of block size and compare loss of space vs real file usage, dependent on the drive size and probably if the drive is SMR or CMR, I do not know the impact of large block size in a SMR drive.

If you really want to improve the I/O the choice is rather simple, move high I/O applications to another storage, use cache and/or delay/bundle commits. You can do one or all.
Everything else that has to do with filesystems are micro to medium optimizations, they help but only to a point.

The main killer of storj I/O performance IMO is a database or journal, it’s the ledger that has a really high I/O and the needle has to move a lot on the plate, so one way to fix it is move it to a high I/O storage like an SSD, or permit usage of external database applications like mysql/mariaDb or postgrees that are either clustered for resilience and are on a high I/O storage, this way you can have one single database engine for many nodes.
Issue is nodes are no longer independent isolated applications and have a single failure point, the database application.

One concern about caching too much stuff in memory, most of these apps run in machines that do not have ECC memory, while single event upset are rare, I would rather not push my luck.

1 Like

I do not consider myself an expert on those things, nevertheless, based on my understanding I can fully agree with you. It seems like you were writing about one of the options covered by @elek in his summary post. BTW, In the other of my posts here I recently suggested simply testing XFS with data and mata being separated as a simple solution to the problem. Heaving say this, in this particular case, I belive it is a little bit a different, a different perspective. It looks to me that the research paper being discussed here is innovative in the filed of filesystems designs. Uff, I hope I made the conclusion right, I read it only three times. :- )

@s-t-o-r-j-user, please let me phrase the difficulty and involved risks of this approach in a different way in hope to make them clearer.

This proposal is a minefield of corner cases coming from all the weird setups that Storj runs on now. And if you make a mistake, customers lose data, and Storj loses customers. It’s going to take months of work to make sure it works correctly, time better spent implementing customer-facing features, like object versioning, bucket logging (I know that the company I work for would consider these two features essential), or zkSync Era support for customers (which could put more liquidity into L2, and so be of benefit to node operators as well).

I did not write this proposal to ask for an implementation now. I had some free time to think about a potential improvement, so I made some notes, just that. I wouldn’t be surprised if it turns out there are tens of low-hanging fruits for storage node improvements that could be implemented in the time it will take to get code for this single draft ready. And when we run out of low-hanging fruits and Storj grows enough to fund bigger projects with no immediate prospects for bringing in more customers, then there will be time for revisiting this proposal.

As for me, I’m only learning golang, having written less than 1kLOC of code that ended up on production, and even that only at a very small scale. While I think I could make a prototype, more experienced people than me would have to fix after me. Providing a design is probably the most you can expect from me now.

7 Likes

Dear Toyoo, I still believe there are characteristics related to the potential innovation in the field of filesystem designs thus my suggestion related to legal aspects connected with intellectual property rights. Such approach would make navigation of various market forces much easier for Storj Inc. which would be for the benefit not only of the final users but also for the storage nodes service providers. At the end of the day, its up to the management of the company and up to you how the situation will be handled. :- )

Oh, don’t worry about that, I’m going to benefit from this proposal anyway. I’ve learned a lot, I had fun working on it, and my IT experience is already richer. And whoever would implement it in future will either prove feasibility of design (good, the ideas were correct, so e.g., I can show off on my CV that I’m good at design), or point out mistakes (good, I’m learning). But working out details of intellectual property law for in such case has a big opportunity cost.

2 Likes

I am not worried. :- )

IF the small files (1 -4kb, or 1-8kb for bigger drives with ntfs, maybe a custom setting for raids with bigger clusters) on a node could be “collected” and stored in one big file

lets say, for each double-letter folder separately to spread risk,

i think it would be one step forward reducing io.

(maybe with some kind of parity/reed solomon, since we save on overhead)

But im no programmer, so you maybe just ignore my toughts.

The filesystem almost doesn’t matter too much. The more important - how we do store data.
The “rightfully” configured FS may have some improvements, but they are limited (also because not everyone is able to do a custom, (very custom in your case) setup), we need something more generic, like @Toyoo suggests (I like their propose by the way, but I also can imagine all complications during the way).

I, personally, more for the proven concepts like using LFS databases - I do think, it has much of desirable goals. Of course, the @Toyoo’s is much better, but I’m afraid of custom storages, not proven with a lot of implementations (sorry!!). However, I still believe, that this approach can be very useful!

To be honest, it seems a bit inconsistent. Regardless, isn’t it true that as for storagenodes, we store data on some kind of filesystem in any case? As for the LFS databases, I understand you are referring to filesystems such as WAFL, ZFS, Btrfs, F2FS, and NILFS, however, I can’t comment on this immediately. I understand that, in general, there are two main ways you can proceed: you may separate metadata, or you may not, which I believe aligns and is confirmed above. Of course, I also appreciate @Toyoo’s research paper very much, which should probably be obvious considering some of my other posts here. However, even though I was an early adopter of Atari and have read this paper three times, I would prefer not to delve too deeply into technical discussions, as these are very specific topics and I am not currently ready to commit enough time to them. My current understanding is that this is an example where metadata is not separated. Mostly, I raised my voice to draw your attention to the potential aspect of intellectual property rights, as I believe that securing such an asset could significantly strengthen the perspectives of your company. As for the other case, where metadata is separated, I have been briefly referring to such a design in my other posts, mostly related to two other recent threads about the filesystems for storagenodes, and I provided a simple example in the recent thread about XFS. However, similarly to @Toyoo’s paper, I would also prefer not to get into specific details. Really can’t commit enough time.

Just wanted to add that should the company decide to strengthen its intellectual property portfolio, you owe me a non-alcoholic drink and a friendly chat at a nice restaurant under an open sky. :- )

I live in Poland. All I can offer is vodka.

2 Likes

Thats probably a topic not allowed on this forum. I am sorry, I almost don’t drink vodka. So, non alcoholic beer and / or non alcoholic Boulevardier would make … a good start. Be my guest.

No deal then, sorry. Vodka is a starter here.

1 Like