Best filesystem for storj

:rofl:

I recommend prewarming the cache if you add it from scratch by running fsck. It’s faster than waiting until the cache fills itself with metadata, as fsck knows the file system layout and can read inodes sequentially. This should in theory also make cache itself better organized, though I don’t have any practical way to check this. However, this is the easiest way to check exactly how much cache you need at a given point.

2 Likes

random partial datapoint.

I’m migrating 2x 7TB nods using rsync. Looks like it’s going to take… weeks… one drive is ext4 and the other is zfs with an l2arc of just metadata.

while the ext4 drive often shows a higher ā€œread speedā€, the zfs drive is a bit ahead on the sync. 2TB vs 1.5TB

this is kind of expected… in general ZFS is slower than ext4 in my tests in several times… (no special device though).

Yes, ZFS is slow when using rsync. The last time I moved a 8TB node it took me almost 4 weeks.

How full was the zfs filesystem? Above 80 percent you get a massive performance hit.

As far as I remember 95% was assigned to storj and it was full… :roll_eyes:

However, at that time I was testing running 2 nodes per HDD and it was still capable to handle that load. I am doing this no more, so please no TOS discussion. :wink:

1 Like

Above 80% is when it starts to more carefully use space. You may not even notice any performance changes until you’re 90% or higher.

So 80% is when there’s a filesystem change in behaviour: not when you ā€œget a massive performance hitā€.

the real quest is if the performance hit is worth leaving empty storage unused for storj…

The amount of performance a node requires is capped by the speed of the Internet connection: which is always much slower than even a full ZFS filesystem can deliver on a HDD. Leave a couple hundred GB free but otherwise fill things up for max rewards.

Ok (20 chars or not 20 chars)

I’m moving 20-ish nodes from ext4 / xfs to zfs. It really differs per time, how much time it takes. Main challenges are the source disk, destination disk, and used copy method (interestingly rsync is speedier in case the source disk is an SSD, then rclone copy; with HDD as a source disk it is usually the opposite).

I usually try to move the filesystem as a whole to SSD if possible (<4TB used disk space), and then from SSD to zfs filesystem on HDD. Often I get speeds ranging from 30-150MB/s in that case, while direct copy than only is 3-8MB/s.

2 Likes

(Editing this after looking at my notes)

Are you doing something like dd/ddrescue’ing the source partition to an image file on your SSD, then loopback-mounting it… then copying from that temporary mount to your destination HDD? I could see that as the dd should be a sequential transfer and pretty much at the max speed of your disk.

1 Like

Jup, something like that. However I use e2image in order to create sparse files. So I can transfer bigger filesystems than my 4TB disk.

2 Likes

Today I finished moving a small node from ext4 to my new zfs setup with rsync. This time I used no network, both disks are local. It was still rather slow, roughly 1 TB / day. The random read performance of the source hdd seems to be the limit.

2 Likes

another data point. or anecdote really.

My migration of two nodes to new drives got interrupted. So the first half was with rsync. The second half was with rclone,which tries to run four parallel transfers.

It’s been 10 or 14 days or so working on just the initial copy.

The zfs source drive finished during this time and copied 5.8TB.
The ext4 source drive is still running over the same time and has only done 2.8TB.

The destination drives were identical 14TB 3.5" drives with ZFS.

So in my migration case, zfs was faster. It probably helped that I had fairly ample caching. 24GB commited to ARC cache (shared amongst a few drives) and also a metadata-only secondary l2arc on SSD.

I found rsync --no-inc-recursive to be faster than regular rsync for moving nodes on ext4, and this would match your observations, as the --no-inc-recursive flag is effectively pre-warming metadata caches.

4 Likes

Update on SSD wear for LVM+ext4 read cache. Been logging my SMART TBW daily. Right now I have 2 distinct period patterns:

  1. July-Aug: When I was migrating from ext4&ntfs to lvm with multiple sync jobs.
  2. Since August: (10 days). Unattended normal operation.

Total is ~25TB used out of 50TB allocated across 4 nodes/disks.
Here’s a chart of daily GBs written.

The peaks at end of July were final move operations where I moved entire nodes from cached disks to another to their final location, pre-allocated some new vhdx caches, etc. The peaks in the last few days are likely TTL deletions.

My SSD is a consumer Kingston KC3000 2TB with nothing else but the caches on it currently. It is rated for 1.6 PBW which translates to to 894GB/day for 5 years.

Everything being equal, at 130GB/day the SSD should support current cache usage for 30 years.
When full, assuming cache usage goes to 300gb/day, for 14 years.

Given all the benefits of this setup (i.e. it solved everything) I’m definitely sticking with it without worry.

Last thing, I’m currently using oversized 256GB caches (18GB/TB). From my previous post I estimate that a 128GB should be more than enough (for 4-8GB/TB). At some point later I’ll recreate caches and compare again.

7 Likes

I don’t have all the numbers and thinking, but would like to share my relative success with zfs + l2arc.

I have 8 nodes on 8 disks, totalling about 43TB. highest amount stored on a single disk is 12TB.

Each disk is formatted zfs, with the following characteristics:

  • compression=on
  • lz4 compression
  • secondarycache=metadata
  • redundant_metadata=some
  • atime=off

I have up to 26GB of RAM allocated to ARC cache. according to arc_summary the cache hit rate is 97% or 98%.

I also have a single SSD that I’m using for l2arc cache. It’s a 5 year old used MLC enterprise SAS drive. with the endurance it will last until the heat death of the universe. I have it split into 8 partitions, created by hand, which I am using as a separate cache for each drive. I am allocating 5GB for each TB of disk space. So far, the l2arc data gets compressed so I have not filled up any partition for any drive.

The performance so far has been… pretty good! Use to be disks would be pegged at 100% pretty much on any activity. Now the only drives showing over 90% utilization doing filewalkers are the two that are fragmented over 50%. The rest are busy, but more chill.

The SSD actually gets worked pretty hard when all the filewalkers kick off at once. IT shows around 50% busy when they all start up. The write activity to the SSD is very modest. Partly because the l2arc only writes to the metadata so fast, and also because things like filewalkers are really read intensive, so very little is written to the SSD at all after the metadata gets initially populated in the l2arc cache.

according to arc_summary the size of the l2arc is a nominal 1.3TB, but compresses down to 285GB on the disks. and requires 3.5GB of ram for the headers. The reported his rate is usually around 70%-80%

But anyway, filewalkers seem to finish in a more reasonable about of time (hours instead of days) and the drives seem to be capable of keeping up with high incoming test traffic days.

6 Likes

This is an older comment, but there is a bug that made ext4 slow in Linux 6.5 that got fixed in 6.8. You may be running into that.

@MarkRose thanks for the notice

I will try in the near feature to mount data/db files on flash
and blockstorage on hdd,
this would be also the recommended for a bitcoin node