Best filesystem for storj

Alexey · January 18, 2024, 4:35am

For me BTRFS results looks too good to be true for storagenode, since we have multiple confirmation, that BTRFS almost worst FS for storagenode: Topics tagged btrfs

JWvdV · January 20, 2024, 7:54am

For me too, especially since also an additional layer of complexity has been involved. Although I think it mostly has to do with the fragmentation of the meta data due to COW nature, so especially when the node is some months old. And I think these benchmarks would be most interesting to perform in different ‘life cycles’ of a node.

s-t-o-r-j-user · January 20, 2024, 11:18am

How about Lustre and Ceph? Any chance to adjust those filesystems for storagenode purposes or are they rather a far cry for us? Lustre supports both ext4 and zfs but it might not be an easy setup. Ceph is supported by Canonical LXD / MicroCloud but its default setup is rather not suitable straight out of the box. Disclaimer: I do see some potential benefits but those filesystems were designed for rather large environments. Anyway, apart to NFS, are you maybe considering doing some testing on them in the foreseeable future @IsThisOn? :- )

Alexey · January 20, 2024, 12:07pm

We moves customers out of Lustre and Ceph to Storj, because it’s more efficient.
For Storagenode - it’s a downgrade.

s-t-o-r-j-user · January 20, 2024, 1:39pm

Well @Alexey, I doubt it, actually, I would even say that at this moment I dont agree with you at all. It might be cheaper as you are pressing storagenode operators very hard, but more efficient it is not. Maybe you are thinking in terms of CDNs (Content Delivery Networks) and referring to such filesystems as Luste and Ceph. This is the only one explanation related to your way of thinking with regard to what I just read that comes to my mind. Should you provide additional information I might change my mind.

zip · January 20, 2024, 3:22pm

Ceph isn’t a filesystem as such, but a software defined storage cluster which can scale both vertically, but most of the time is scaled horizontally and which by default does either EC or per object replication.
The energy costs to run it are also quite high as the cluster itself consist of multiple components running on many servers. You also need redundant network of at least Nx10Gbps to have a truly redundant, zero downtime cluster.
Maintenance is also a thing, especially if something unexpected happens, and unless you have a support contract or you are lucky and will find a solution on the Internet, you are on your own, which in many cases might even mean loosing significant amounts data.
So I can understand people will try to migrate away, especially in regions with huge inflation and inflated energy costs.
To run a Storj on Ceph is a no go, unless someone else is paying for hardware and energy. I was running Storj on Ceph, but as @Alexey is saying, it isn’t efficient at all because you either have to do EC or per object replica of all the data in that specific pool. Once Storj changed the payouts to more sustainable values, I was no longer able to expand the storage with node earnings.

CutieePie · January 20, 2024, 3:40pm

Just to be picky, and to save confusion Ceph isn’t a filesystem, it is a reliable distributed, self healing, self managing intelligent block storage system.

On top of ceph RADOS, you can use the native RBD Images that most people are familiar with to map a virtual block device via native RBD clients, or ISCSI gateway services.

There is also the Ceph File System CEPHFS, that allows for huge scale and metadata management of the filesystem.

We also have the Native NFS server gateway, and the S3 API compatible gateway all built on top of RADOS.

… As for which file system, I have a few Storj nodes running on my Dev Ceph Cluster (8 hosts), and have settled on an erasure coded pool ,data chunks (4), coding chunks (2) to give the correct level of read performance, while not causing too much write amplification - again the RBD image is tuned to a segment size of 2MB, which again is hard on the cluster - but seems to be the sweet spot on my setup.

After going through many filesystems on Ceph RBD images on native clients, ext4 has been consistently the most stable, although I have tried XFS a few times, but it consistently ends up performance degraded after 6+ months in operation.

As for Ceph’s suitability to host storage nodes, it really isn’t ideal for node operators either stick to the one disk per node, or if you want something a bit more reliable and have the $$ ZFS

Alexey · January 20, 2024, 3:50pm

no, I mean they use Storj as a private network instead of ceph.

CutieePie · January 20, 2024, 4:05pm

Again just to be picky that was a legacy requirement, and still is a recommendation that you separate the client traffic from inter node traffic - however, the stability and availability of the cluster does no longer require a redundant network.

Also, the size of the cluster recommendations have been updated - while previously you would be able to spin up a semi-stable cluster on 3 nodes, the recommendation is now a minimum of 5 for a production cluster.

Agreed, the legacy products were complex - however, you will find from v18, many of the default options are now being adjusted to better represent the hardware people are choosing. Also with the new CEPHADM deployment model, and containerised modules and managers the ongoing management is being simplified, although there is still a way to go… (ignoring the current Gui rebase )

I don’t think you are on your own, bit dramatic The Ceph project has a really active community on the newsgroup, and are always happy to look into issues logged on the tracker. Unfortunately allot of the commercial solutions are behind a support paywall, and many of the solutions you will find on the internet can be dangerous.

As for the loosing significant amounts of data, that’s not a Ceph issue - that’s a user issue following instructions on the internet that may or may not be suitable for their environment like many things a little knowledge is dangerous.

#edit - sorry that sounded like I was blaming someone and I’m not… The previous defaults would allow as an example a 3+2 erasure coded pools with unsafe minimum Data chunks, that gave admins very little warning something bad was about to happen - usually the first signs were huge ceph backfills, and rebalancing… The new defaults, take the pool offline before we have exhausted all the M (2), to give the admins time to fix and also prevent a wave of backfills, at the expense of disks being offline for client applications. Again, EC pools with only M+2 and replication 3 are not ideal in critical production workloads.

s-t-o-r-j-user · January 20, 2024, 4:41pm

Well, @CutieePie, just to be picky, what is your point if I may ask? May I ask if you can read? Could you re-read what I wrote above in relation to Ceph, please? As well as re-read your own writing please?

I still do hope that @IsThisOn is gonna do some serious testing. How many times have I read here that NFS is causing so many problems … to the point that it is absolutely unusable filesystem for running storagenodes.

zip · January 20, 2024, 4:51pm

I was not referring to this exact requirement. I meant to have the network redundant with multiple switches and multiple separate physical connections to each server (like vPC), so in no case, such as switch maintenance or outage, the cluster components will loose connectivity to each other. But as you say, this depends on how the cluster is configured, how many components you have etc. In a smaller cluster this might be a significant problem causing data unavailability and then rebalance, in a bigger one this might be no problem at all.

This also depends on the distribution you are using to run Ceph at. In some, Cephadm is not a usable thing I guess. This also entails another learning curve for people familiar with existing provisioning or unfamiliar with more usable distributions. Then all these containers etc., it just adds another layer of complexity in my opinion, and places where things might break. But of course your mileage might vary and of course the most important thing in a company is the headcount . That is also the reason everyone is trying to oversimplify and automate everything. When it however breaks, then it is a game over because these people used to copy and paste these wiki commands (and in a corporate world it might mean the internal wiki hasn’t been updated for quite a while because the maintainer either left the company three years ago or has no time to work on it anymore), with zero understanding of how it actually works underneath, won’t be able to fix it then.
I’m of course not pointing fingers at anyone, this is just my experience unfortunately.

I had an issue with some kind of a migration getting stuck (not remember now what exactly), and I was not able to cancel it nor complete it. Tried in the mailing list and got no answer, which was understandable as nobody faced this issue before.
Then raised an issue in the bug tracker to maybe get some hints, and I guess it has been untouched till this day (I stopped checking after some months).
So again, it depends on the issue.

And to not sound as someone who is trying to undermine the qualities of Ceph, I really like the concept, I love you can have RBDs and CephFS on one cluster, I like it being scalable, opensource, performant, that you can have placement rules, you can have mix of HDDs, SSDs and NVMes, but for any serious work you simply can’t be on your own. And I guess we can agree running Storj on Ceph will bankrupt you eventually.

CutieePie · January 20, 2024, 4:51pm

ha

I wish we could change the name… it is indeed confusing.

CephFS sits ontop off RADOS (ceph)… but RBD doesn’t need CEPHfs it honestly is really bad names and confusing.

I think we both agree

Ceph - CEPH is not a Filesystem

but

CEPHFS - POSIX compliant file system running with MetaData managers(MDS), RADOS storing Metadata and Data on CEPH

CutieePie · January 20, 2024, 5:59pm

Ah ok, again this can be designed out using the crushmap export and customization - you can define racks and switches for the OSD’s, Hosts etc so CEPH knows how to distribute the data to allow for rack and switch failure, depending on how you have configured your pools custom encoding policy…if it’s at the default of OSD then RIP cluster …but agree on small installs… well, If it’s all in one rack with a single switch, maybe safer to walk away

Hmm yes, it can be hard with systems on air gaps - but you can work around this with your own satellite servers (not Storj ones ) I use to be a fan of the old install scripts, yum install’s of ceph and components…but the Ceph upgrades were terrifying, even point releases would be sweaty, as things would break and no one likes a half upgraded Ceph Cluster

Podman/Docker Ceph installs and upgrades are beautiful to watch the Ceph MGR’s control all the orchastration, and you have complete control over the cluster release level… From a Dev viewpoint, you have exact version point control on the release being run in the cluster…no need to annoy a server person to “do stuff”

Yes this is unfortunately a thing with some organizations - usually a result of poor management, and a lack of strong procedures. The end result is good engineers end up leaving, or are displaced and the teams left behind are put under pressure - not all company’s are like that though, there are good ones out there

oh yes, there we some “interesting issues” since V16 - the main ones were with Arch64, there was a checksum code issue that caused the OSD to fail when running Checksums against the DB on start-up, and made the cluster get stuck half upgraded - this was due to a change in the upstream 64bit code used, I had a few cluster impacted by that - but was fixed in a few days.

There was also a cross architecture regression, where running mixed X64 and Arch64 clusters would get stuck in a loop upgrading the mgrs - the quick fix was to migrate the mgr to same arch as the lowest ID OSD hosts, although a fix was released.

The Ceph upgrades are always cancellable… That task will be handled by the primary mgr, so a ceph orch upgrade stop, and a ceph orch upgrade status should do the trick. You can find the mgr on older releases would disconnect, so if they are ignoring the commands you can either force a failover with a ceph mgr fail ensuring you have one in standby, or a more brutal method is to remove the mgr from the host and redeploy; either with;

ceph orch apply mgr 5 <- would be 5 random mgrs

or

ceph orch apply mgr --placement "hostname1 hostname2 hostname3"

But yeah, it use to be hard, upgrades were unpleasant and had to be rehearsed for weeks the scripts used in V18+ are more friendly, so if something is going to break, the upgrade stops before taking your cluster offline and will in some cases not even start if you have a config which will break.

It’s not for everyone, its strength is it’s distribution, and data resilience and I agree not something you can do seriously by yourself without support.

#Sorry I’ve de-railed this thread…

#Best FS : Ext4

#NoMoreCephFS

s-t-o-r-j-user · January 20, 2024, 6:36pm

I still cant stop smiling :-). Indeed, in my initial post here, I was referring to Lustre and Ceph filesystems as my post was something related to my other today’s post in the thread about @Toyoo’s … research paper. My focus in both of those posts was on topics related to "metadata database” and “metadata targets”. I am very happy that you raised your voice as my experience with Ceph is limited (I tried it twice but decided for GlusterFS as it better suited my needs (not storj related needs)). You are mentioning docker / podman as an easy way to deploy Ceph. I would stand with my initial suggestion for MicroCeph. It is super easy. However, in general, I share the opinion that for the needs of operating a storagenode it is a bit overblown system, also default MicroCeph configuration is rather not suitable at all. Nevertheless, some concepts are very interesting. BTW, do you have any experience with Lustre? In many ways, it seems to be lighter, at least on the paper.

IsThisOn · January 20, 2024, 6:37pm

Maybe. As said before, wer misst, misst Mist.
Coming up with a benchmark that has some meaningful implications is extremely hard.
FS are freakishly complicated and stuff like fragmentation is so hard to test.

The only “easy” benchmark I can think of is how long it takes to run filewalker? This also seems to be the benchmark people care about the most. Which I think is strange, considering how infrequent that happens, but I use my ZFS host and not some pi, maybe that is why I don’t get it.

Testing of what?

It does not. Why should it? Makes no sense. My NFS connected share that is backed by ZFS is way, way, way more stable than a NTFS filesystem. Disclaimer: DB is on local nvme and NFS share is with sync disabled.

Instead of running obscure benchmarks, I find @Toyoo’s ideas on how to get down node requirements by handle stuff completely differently more interesting. That is what would really get us into running cheap storage nodes on something like a pi.

s-t-o-r-j-user · January 20, 2024, 6:49pm

I see that there are probably some misunderstandings taking place. Anyway, I was suggesting to discuss / to test also Lustre and Ceph, particularly Lustre as a viable way to run the nodes. Well, I share your opinion about @Toyoo’s ideas expressed in his research paper. I guess, a question is: are you ready to write almost totally new metadata based filesystem from the scratch? Or maybe there are any already existing and ready to be utilized almost straight out of the box? Cheers. :- )

Alexey · January 21, 2024, 4:33am

It’s easy - setup a new node using interesting FS and share results. You may also try to migrate one of your node there, which you do not mind to loose, this experience would be much useful for the Community, because the new node usually does not discover FS issues right away while small.

snorkel · January 21, 2024, 5:50am

There should be a way to run the same node on different configurations - like different filesystems, different setups, different parameters without beeing disqualified or loose online time while moving data to other disk.
I’m thinking of a test-mode, not payed, on Saltlake for ex., to not interfere with the production sats and data.
For ex. you have gattered 10-20TB of data and you want to test different settings and filesystems, even the @Toyoo 's new proposed solution with packing. You copy the entire node on a new machine or on several, set the storagenode to test-mode, and start your testing, with the same identity, without beeing payed, penalised for downtime, desqualified for runing more than one node with the same identity, interfere with the production node, etc.
I don’t know how the clients activity could be duplicated on these, though…
To limit the abuse, there should be some limits like don’t run more than X instances in test-mode, or limited spots on that satellite for test-mode nodes.

Alexey · January 21, 2024, 7:17am

That’s probably not possible - the audit worker is not aware of multiple copies, it will just ask for the piece, and if that piece on another copy, but not here - well, this audit will be considered as failed.

Perhaps it’s better to write a tool, which could emulate node’s behavior with multiple TBs of data.
Or join a new node to the QA satellite:

And ask to upload several TBs of test data, or do it yourself joining it as a customer.
@littleskunk thoughts?

snorkel · January 21, 2024, 7:40am

I just read an article on Ars Technica about Reiser FS, that was very fast, and someone said how good it was with handling many small files on old drives. Maybe is something to look into?