I am not familiar with either. What I do find interesting though is that Ceph already has object storage, so I suspect if there was a version of storage nodes which would basically act as a proxy, storing each piece as an object in Ceph, this approach might be reasonable.
Unfortunately, I cant add more at this moment to the discussion on Luste and Ceph. I am linking to the posts by @zip [link] and @CutieePie [link] if anybody is interested to read a bit more about Ceph. (BTW, sorry @zip for not asking about you opinion earlier, it was not on a purpose.)
Is there tool beside to copy the disk content to a new drive to improve the inodes layout on an ext4 ?
I experience this issue now too, my primary application is to run a bitcoind core node.
Storj makes suddenly after a 800GB delete and refill now insane amounts of IO Wait
It uses 9MB/s read on the HDD and time bitcoin-cli -netinfo
probs have something in between 30s, 2m, 3m
No. I was looking for something like that, and the best you can do is to manually reassign inodes to files using debugfs
. But that would indeed be a great tool for nodes on ext4. Writing one is somewhere on my radar, but itās not trivial not to accidentally mess up the whole file system.
What is the solution to resolve it?
My HDD has 3TB only 2TB used for the storage node.
I can copy the storage node data to a new old-drive with ext4
but this will not resolve the degeneration of the storage-node
at a later point in time.
No solution for now. The file system does its best to place inodes of a single directory close to each other based on information it has at the time of file creation, but itās not perfect from storage node perspective.
I realy donāt know how exactly the assertion of the file-walker is made
and why it generated more or less seq-reads and after the fragmentation a āstormā of rnd-read operations.
I think the only solution would be to optimize the file-walker or what ever process making this rnd-read load, by simply try to sort the requests to the data layout on disk.
Maybe by region or block-date?
The same could be made in reverse too,
the data layout on disk could be reorganized to produce more seq-read for the file-walker
Even if the file-walker assertion would allow large batches that could be first sorted and after read from disk would help.
Can only speculate, at the end it is currently not possible to use a single HDD disk as a shared drive.
It seams, I resolved my issue, maybe it is to early to say.
As I mounted the second drive, I saw that the first was mounted with only default options
and the relatime
option was not set.
The kernelās deadline scheduler will do most of this work already. Unfortunately itās still not enough. I have a piece of experimental code that iterates over files in the inode number order (inode number directly refers to physical location on an ext4 file system). It didnāt bring much of an improvement, and I suspect would degrade performance on other file systems. The additional problem here are the directories, location of their indices is not exposed in any way to user space, so no userland program can try optimizing read patterns.
I moved to btc chain data to an other new old-drive
The storj has the drive now buy itself.
The node with 2TB data on a 3TB drive, read the files in the blob storage directory with around 11MB/s for ~20 hours. 11MB/s is the max rnd-read speed of a HDD 7200rpm drive.
The forgotten relatime mount option helped to reduce the write IO to zero.
Currently it is more or less idling.
I checked the file access with strace -p $pid -f
it is a non-issue for me,
but it would still be nice to know an optimal filesystem configuration
that does not involve overengineered storage arrays.
There seems to be a mixture of art and science to know what combo of recordsize/small-file-support etc should be set to properly size a ZFS special metadata device for a Storj node⦠but it seems ideal. It soaks up all filewalker/metadata-type workloads, provides consistent acceleration(because itās always used: youāre not hoping something is in ARC/L2ARC), you donāt need the extra RAM an effective ARC requires, and it speeds metadata writes too (not just reads like ARC/L2ARC).
I need more time to play with it myself. Any sort of performance testing can take so long when your sample data for a node is millions-of-files-per-TB.
I have two 8TB drives. One is formatted ext4, the other zfs. both are mostly full with storj data.
the ext4 drive is showing 100% busy all the time with linux atop command. The zfs drive often lower like 50%. this is even when the filewalker is finished.
each drive is itās own node so there are probably some differences in node traffic.
but zfs is a win between those two drives.
No elaborate zfs caching or SSD use besides having 18GB of ran working as the ARC for this and a few other drives.
Iām also inclined to think so for all non-SSD storage.
So, Iām reverting all my nodes to this config. Will take a week or twoā¦
Persistent L2ARC on SSD might be a second-best.
Whatās the RAM usage for both setups (ext4/ZFS)?
Or is it the same system?
same system for both drives.
the system has 34G ram total, linux shows 166M being used for cache and 2.3G used for buff (there is still 1.3G free showing). That is for the only two ext4 drives, which are the minimal boot drive and the 8TB ext4 storj drives.
and then 24G of zfs ARC for four zfs drives, two storj and two non-storj.
loadwise, I didnāt realize that the ext4 was running a filewalker that is taking multiple days, so I guess I should say the observed load is lower on zfs while running a filewalker, but I donāt have a āidleā utilization number yet.
Have not seen lot of attention given to ext4+lvmcache, so Iām giving it a shot after the disaster that is my 7TB NTFS node constantly at 100% I/O.
My initial impressions so far are very positive.
Setup:
- Ryzen 3700x 64GB RAM, Windows 11 host + hyper-v ubuntu 24.04 with 24GB RAM & docker nodes.
- Seagate Mach2 14TB SATA with ext4 + 128b inodes + noatime. VM passthrough (so I can leave Windows later if no longer needed).
- 200GB fixed VHDX allocated on 2TB NVMe - one per node.
- Test data: 2.3TB with 22 million storj files. du + sync + clear RAM pages/caches before measure.
Goals:
- Single drive nodes, up to 4x14TB.
- No hard dependency on the SSD. All slots on this PC are already populated + want to move stuff around easily and safely.
- Lightweight with low RAM requirement. I need at least 32GB free RAM on Windows host to run GPU compute (Salad now⦠maybe Valdi one day?).
Early results:
- Hot spot writethrough algorithm looks good so far. No excessive writes during rsync, inodes promptly cached on read, no HDD activity during du on 2nd run. Will be monitoring SSD wear and inode eviction rate more later.
- Tested node recovery by removing the SSD cache: no problem, back up in 30 seconds with a simple lvconvert --uncache.
- du: 431 seconds (7.2 minutes). Scaled to OPās chart with 8 million files: 156 seconds.
- iostat during du: ~1500-2000 tps
- Cache size estimation: SSD reads ~15MB/sec constantly during du. Over 431 seconds that makes ~6.5GB read or <50GB when 14TB is filled. 200GB cache is probably too big.
- Migrate data out with rsync: 30MB/s with cache vs 10MB/s plain ext4
- lsblk is a mess
Is zfs the only option to only cache the metadata on flash?
ext4+lvmcache would make sense, but I am sure that simple data access will wear the SSD/NVMe out.
If only the filewalker makes this high IO issue,
then why is there no option/feature to index the layout in a file-db outside of the node directory ?
It seems like they already testing such solution: storagenode/blobstore: blobstore with caching file stat information (⦠· storj/storj@2fceb6c · GitHub
ok, good to hear that work is already been done.
SSD wear might not be that bad. For example the consumer 2TB Iām using has 1600 TBW, or ~0.87TB/day over 5 years. Thatās enough to handle test data churning 100mbps all day for a couple of years.
This answer describes that more than one read miss is needed to promote to cache which should greatly reduce wear even in the worse case. I imagine L2ARC operates similarly. Iāll track data written over the next month and see.
Will also be checking out the StorJ badger database solution when its ready!