We moves customers out of Lustre and Ceph to Storj, because it’s more efficient.
For Storagenode - it’s a downgrade.
Well @Alexey, I doubt it, actually, I would even say that at this moment I dont agree with you at all. It might be cheaper as you are pressing storagenode operators very hard, but more efficient it is not. Maybe you are thinking in terms of CDNs (Content Delivery Networks)
and referring to such filesystems as Luste
and Ceph
. This is the only one explanation related to your way of thinking with regard to what I just read that comes to my mind. Should you provide additional information I might change my mind.
Ceph isn’t a filesystem as such, but a software defined storage cluster which can scale both vertically, but most of the time is scaled horizontally and which by default does either EC or per object replication.
The energy costs to run it are also quite high as the cluster itself consist of multiple components running on many servers. You also need redundant network of at least Nx10Gbps to have a truly redundant, zero downtime cluster.
Maintenance is also a thing, especially if something unexpected happens, and unless you have a support contract or you are lucky and will find a solution on the Internet, you are on your own, which in many cases might even mean loosing significant amounts data.
So I can understand people will try to migrate away, especially in regions with huge inflation and inflated energy costs.
To run a Storj on Ceph is a no go, unless someone else is paying for hardware and energy. I was running Storj on Ceph, but as @Alexey is saying, it isn’t efficient at all because you either have to do EC or per object replica of all the data in that specific pool. Once Storj changed the payouts to more sustainable values, I was no longer able to expand the storage with node earnings.
no, I mean they use Storj as a private network instead of ceph.
Well, @CutieePie, just to be picky, what is your point if I may ask? May I ask if you can read? Could you re-read what I wrote above in relation to Ceph
, please? As well as re-read your own writing please?
I still do hope that @IsThisOn is gonna do some serious testing. How many times have I read here that NFS
is causing so many problems … to the point that it is absolutely unusable filesystem for running storagenodes.
I was not referring to this exact requirement. I meant to have the network redundant with multiple switches and multiple separate physical connections to each server (like vPC), so in no case, such as switch maintenance or outage, the cluster components will loose connectivity to each other. But as you say, this depends on how the cluster is configured, how many components you have etc. In a smaller cluster this might be a significant problem causing data unavailability and then rebalance, in a bigger one this might be no problem at all.
This also depends on the distribution you are using to run Ceph at. In some, Cephadm is not a usable thing I guess. This also entails another learning curve for people familiar with existing provisioning or unfamiliar with more usable distributions. Then all these containers etc., it just adds another layer of complexity in my opinion, and places where things might break. But of course your mileage might vary and of course the most important thing in a company is the headcount . That is also the reason everyone is trying to oversimplify and automate everything. When it however breaks, then it is a game over because these people used to copy and paste these wiki commands (and in a corporate world it might mean the internal wiki hasn’t been updated for quite a while because the maintainer either left the company three years ago or has no time to work on it anymore), with zero understanding of how it actually works underneath, won’t be able to fix it then.
I’m of course not pointing fingers at anyone, this is just my experience unfortunately.
I had an issue with some kind of a migration getting stuck (not remember now what exactly), and I was not able to cancel it nor complete it. Tried in the mailing list and got no answer, which was understandable as nobody faced this issue before.
Then raised an issue in the bug tracker to maybe get some hints, and I guess it has been untouched till this day (I stopped checking after some months).
So again, it depends on the issue.
And to not sound as someone who is trying to undermine the qualities of Ceph, I really like the concept, I love you can have RBDs and CephFS on one cluster, I like it being scalable, opensource, performant, that you can have placement rules, you can have mix of HDDs, SSDs and NVMes, but for any serious work you simply can’t be on your own. And I guess we can agree running Storj on Ceph will bankrupt you eventually.
I still cant stop smiling :-). Indeed, in my initial post here, I was referring to Lustre
and Ceph
filesystems as my post was something related to my other today’s post in the thread about @Toyoo’s … research paper. My focus in both of those posts was on topics related to "metadata database” and “metadata targets”. I am very happy that you raised your voice as my experience with Ceph
is limited (I tried it twice but decided for GlusterFS
as it better suited my needs (not storj related needs)). You are mentioning docker
/ podman
as an easy way to deploy Ceph
. I would stand with my initial suggestion for MicroCeph. It is super easy. However, in general, I share the opinion that for the needs of operating a storagenode it is a bit overblown system, also default MicroCeph
configuration is rather not suitable at all. Nevertheless, some concepts are very interesting. BTW, do you have any experience with Lustre
? In many ways, it seems to be lighter, at least on the paper.
I see that there are probably some misunderstandings taking place. Anyway, I was suggesting to discuss / to test also Lustre
and Ceph,
particularly Lustre
as a viable way to run the nodes. Well, I share your opinion about @Toyoo’s ideas expressed in his research paper. I guess, a question is: are you ready to write almost totally new metadata based filesystem from the scratch? Or maybe there are any already existing and ready to be utilized almost straight out of the box? Cheers. :- )
It’s easy - setup a new node using interesting FS and share results. You may also try to migrate one of your node there, which you do not mind to loose, this experience would be much useful for the Community, because the new node usually does not discover FS issues right away while small.
There should be a way to run the same node on different configurations - like different filesystems, different setups, different parameters without beeing disqualified or loose online time while moving data to other disk.
I’m thinking of a test-mode, not payed, on Saltlake for ex., to not interfere with the production sats and data.
For ex. you have gattered 10-20TB of data and you want to test different settings and filesystems, even the @Toyoo 's new proposed solution with packing. You copy the entire node on a new machine or on several, set the storagenode to test-mode, and start your testing, with the same identity, without beeing payed, penalised for downtime, desqualified for runing more than one node with the same identity, interfere with the production node, etc.
I don’t know how the clients activity could be duplicated on these, though…
To limit the abuse, there should be some limits like don’t run more than X instances in test-mode, or limited spots on that satellite for test-mode nodes.
That’s probably not possible - the audit worker is not aware of multiple copies, it will just ask for the piece, and if that piece on another copy, but not here - well, this audit will be considered as failed.
Perhaps it’s better to write a tool, which could emulate node’s behavior with multiple TBs of data.
Or join a new node to the QA satellite:
And ask to upload several TBs of test data, or do it yourself joining it as a customer.
@littleskunk thoughts?
I just read an article on Ars Technica about Reiser FS, that was very fast, and someone said how good it was with handling many small files on old drives. Maybe is something to look into?
This is a dreary abandoned something. It dropped out of my tests at the very early stage.
You can try it and report how is it good
Why? Do you have some explanations?
I recall running my desktop on ReiserFS around 2005-6. I remember recovering from an unclean shutdown wasn’t a pleasant experience, but other than that, I didn’t have problems.
Why dont you check it by yourself? As for Lustre, the configuration you are probably looking for to test is referred as a single-node or single-server Lustre setup with the Metadata Server (MDS) located on NVMe drive and the Object Storage Target (OST) on a standard drive. In case of multiple OST drives acting as JBODs, create separate Lustre file systems and mount each Lustre file system at a separate mount location. In this example, Lustre uses the storage device (OST) as a raw device, and the Lustre metadata and data are managed at a higher level by Lustre itself. Proceed with caution as I might messed up a thing or two. I did only extremely preliminary testing. As for Ceph, in general it is super easy to get it running thanks to MicroCeph appliance. However, I do not think that Ceph is particularly suited for Storj use case, at least not in a foreseeable future and in a default MicroCeph configuration but I might be wrong. Maybe @CutieePie can provide additional info. Personally, in general, as for now, I am more optimistic about Lustre. My major concern here is that Lustre is suited better for larger files and large environments. I am running my storagenodes on Lustre (I am just using it, I dont do any administration) on a daily basis but the array is all flash. On the other hand, storagenode workflow most of the time is rather very light. To sum up, I mentioned those systems in relation to @Toyoo’s research paper :- ), as it looked to me that he was putting some emphasize on metadata there and both of those systems are metadata focused. What do you think about this Ceph / Lustre idea, @Toyoo? :- )
Because you interested to run it on a network filesystem. We stated already, that it’s not supported setup, however, nothing stop you from trying.
The topic is “Best filesystem for storj” so please do not make any assumptions without a strong basis, especially that I am very far away from asking you to support any of those technologies.
I do not make any assumptions, I just suggest to try it. Simple like that.
I guess we as a Community want to come to conclusion what is best FS for storagenode. I cannot imagine how this could be proven without practice and real tests.