I recently embarked on a task to migrate my Storj node data from a source disk formatted in XFS to a destination disk formatted in Ext4. Previously I had two nodes running from that same disk which caused high Iowait and the system to become unresponsive - time to move the nodes to dedicated disks and DBs to an SSD!
I used the rsync -a --whole-file -delete command to transfer approximately 5.2TB of data. Source and destination had no other obvious disk accesses during that time. The transfer displayed remarkably consistent speeds in certain ranges, but these fluctuated unpredictably over the course of the transfer, approx. as follows:
7 hours averaging 22MB/s
8 hours consistently at 12MB/s
16 hours approximately at 14MB/s
24 hours sharply dropping to 2MB/s
21 hours increasing to 10MB/s
10 hours peaking at 38MB/s
Final 3 hours concluding at approximately 5MB/s
As can be seen, the speed stayed consistent within these ranges for extended periods:
I’m trying to understand the possible reasons behind such erratic transfer speeds. Could it be related to file system fragmentation, the nature of the stored files (many small, non-sequential files), or perhaps I/O contention on my system?
Currently copying another node - similar scenario, disk had 2 nodes before, but currently both nodes run on the source disk - and the consistent fluctuations repeat even more extreme now:
Under the assumption of truly no other obvious disk accesses or other software or hardware processes that would bottleneck the process:
Average file size is indeed one of the primary things that affects rsync speed—many small files are slower to transfer than few big files. Rsync copies files in a depth-first order, so it’s blobs and garbage separately, and within each group, satellite by satellite, and within each satellite, each two-letter directory separately. As each subdirectory is a proper random sample from all pieces you store, the average speed within a satellite will be indeed constant given no other perturbations. At the same time satellites do store files with different average size.
So, at least to some degree this hypothesis would explain at least some of your observations. You can verify this hypothesis by considering a time-stamped log of file-by-file data transfer, for example produced with annotate-output rsync -v … (annotate-output comes from Debian’s devscripts package).
I guess some smaller blimps might be due to some ext4 features, like htrees for direntries. Small directories do not use them, large ones do, and both approaches have different performance characteristics. Again, due to the random sample distribution, among two-letter directories there might be different proportions of small to large directories.
It might also be inconsistent fragmentation of your source directory or inode structures, but that would require more analysis. I don’t know XFS, sorry.
BTW, I see in your post history that you are using single-board ARM devices. I wouldn’t discount hardware overheating and hence slowing down, or some unrelated processes increasing latency of synchronous operations, like journal writes. I did have a device in the past where the network card and the SATA controllers were on the same slow bus, impacting each other.
I’m in the same boat doing migrations away from XFS to ext4 and see the same thing on multiple nodes. I believe it is due to file sizes where some satellites might have more smaller files than the others. It also looks like it might depend on the node age to a certain extent, where the older nodes might have less amount of smaller files if they were full for some time already.
Checking the read IOPS on the source disk and it pretty much is consistent the whole time, while the copy speeds are varying wildly approximately mimicking what you see.
This attached example isn’t the very best, but the only one I have at the moment, but I’ve seen examples where the fastest copy speed took the majority of the copy time and usually was at the start and then at the end of the whole copy while I was doing parallel cp of the whole Storj folder.
To me, it looks like XFS is quite bad at deleting files, which storagenode does often and in big batches of reasonably small files.
The nodes I run, running XFS, take much longer to do filewalker when compared to nodes running ext4, especially if there is a trash that needs to be purged at the same time. This purge rate is also slower when compared to ext4.
All this then brings the write latency up significantly for extended periods of time, causing these nodes to loose races.
I then found this article T2 Magazine - Linux filesystem benchmark 2008/1 (t2sde.org) which to some extent confirmed my observations, along many other forum posts of people complaining about poor delete performance. I also tried increasing RAM available to storagenode, which did help to some extent, but as the nodes grew, the problem amplified again.
In some of these forum posts, people were recommended some mount options to improve this, but when I did these, it looked to me the ls command took so much longer than before and with no visible improvement to the deletes.
This is also the reason I was asking @syncamide to try running delete tests on the various filesystems he was benchmarking, because the linked article is quite old now and some improvements to XFS might have been done since then.
I kindly asked you @Alexey recently not to ask me to do any testing as it seems to me like a domain of storj. Maybe also @snorkel’s friend could provide you with some additional advice on this topic. :- ) I have to really say, sorry, man. Im currently running some nodes with XFS but my setup is nonstandard. I believe it is fully in line with storj ToS but you really dont want to hear all of its details as in a nutshell Im running XFS on top of Lustre with meta on ext4. Nevertheless, I am surprised to hear about all those problems related to XFS. For a lightweight setup I would probably try XFS storing data blocks with XFS meta being separated to nVME (if possible). On paper it looks to me like a great solution. Should it not work, I would probably blame storj. As for an agile and super reliable setup, I would probably go for a combination of Lustre and ZFS. Possibly highly tuned by @arrogantrabbit. Lustre working as a read cache and ZFS cached on the write side (something like a combination of Amazon File Cache and Oracle ZFS Storage) :- ). As for problems with rsync, I do not think it is only related to disk seek time constraints but I cant provide you with a precise explanation. I can refer to my other post on this topic, however, small disclaimer is, that the focus of the post is mostly on data transfers over WAN. Nevertheless, I still believe there is some useful info in case of local transfers as well. I will also try to update this post with a few other tools possibly later today. Cheers.
We likely will not test a non standard setup like something over something over something on top the some custom RAID or modified filesystems.
Our team is not so big, we do not have enough resources to test every possible setup, I’m sorry. So this is why we usually ask Operators who want to use a custom setups do it on their own. Then later we update a documentation when some setups are obviously either not stable, or lead to disqualification (like splitting a storage folder to different disks).
My opinion - the more simpler setup - the better.
Could you please re-read? I was not asking you for testing of XFS and ext4 on top of Lustre, however, (as the off topic), exposing separate XFS and ext4 partitions to a storage node for data and meta respectively, possibly seems to look like a viable alternative to a proposal expressed in the thread titled “Design draft: a low I/O piece storage”. And now to the point, it is to your post #24839, what do you want me to look at? Can you check a setup consisting of XFS storing blocks at one place located on a spinning drive and XFS storing meta at another place located on NVMe/SSD drive, and come back to me with your conclusion please? Otherwise, I will understand that you are just pointing to other peoples posts.
Out of my head, so not tested, with additional tuning needed:
The Community will wait for your report. Since you didn’t change the content of the storage folder (didn’t try to mount every subfolder as a separate disk for example), this should work.
How good/bad - we would wait for your report in this regard.
Alexey, I kindly informed you about 3 times not to ask me to do any testing. I am really sorry but I am currently not interested in such a setup. (Maybe it will change in the future but as for now it is as it is). Nevertheless, I have read more then a few times your very negative opinion on XFS which I have to admit was very surprising to me. If this is to your interest check it by yourself and let me / us know.
I still do not understand you completely and your intentions and wishes, sorry.
If you trying to do something - then I would expect that you share the result of your trying.
If this is just expectation that someone will do what you suggested, then it’s a different thing.
I see there must be a misunderstanding. Do you still sustain your position that XFS cant handle correctly storagenode operations? This was my understanding thus I raised my voice. Sorry, I cant do any testing about such a fundamental things as filesystem performance on behalf of your company. And if not the first nor the second, then, whats your problem?
I do not know. I can only point you to the posts of other SNOs who uses xfs: Topics tagged xfs
SNO reported a more slow processing for filewalker and during migration, like in this thread.
We are not talking about Storj Labs, only about the Community. My problem that you suggest to do something, but refuses to provide any examples that it’s working and seems expect that Storj Labs would stretch resources to test any exotic configurations above a standard generic filesystems. It could be possible though, but the priority for such tests is very low, so only help from the Community can move it further to figure out what’s best for storagenode. You did contribution by suggesting usage of some combination of xfs, its tuning and using Lustre, however it could have even more value with the proof of concept. If you do not want to do a PoC - fine, but unlikely it will be tested by Storj Labs, please do not expect this. Maybe someone else from the Community would like to test this configuration and provide their feedback.
My opinion did not change so far - the best FS for Linux is ext4, for Windows is NTFS. All custom configurations may work better, but they are not tested fully and Storj Labs doesn’t have excess resources to test every possible configuration for each FS, so I wouldn’t expect such tests any time soon. This is where the power of the Community plays role - someone suggested something and tried this and got better (worse) results and shared this with other Community members.
However, Storj Labs do perform some tests to find a better way to store data, like mentioned here:
Now I am starting to understand it a little bit better, however, still not quite sure if in full extend. And I am really not sure if I want to dive any deeper. Nevertheless, I guess that instead of running a node and exchange ideas and knowledge about topics related to information technology you want me to be a kind of a participant in “an Amway or an Avon sociological movement”? Actually, I really cant do it. Sorry. And again, I am very sorry, however, as for now, I cannot test the draft of the config I provided above. If, as I understand your writing, you, nor your company, cant commit enough resources to test it by yourself / itself, maybe somebody else could deploy it on her / his / its new and shiny hardware infrastructure, subject, the kernel would support this kind of a filesystem. :- )