Parallel copy / rsync

Thanks to the recent post about a node migration plan I took a look at my notes related to the option of parallelization of copy / rsync. In my notes I found three options.

1. The first option is about rsync with gnu parallel:

# for local use:
cd src-dir
find . -type d -o -type f |
  parallel -j10 -X rsync -zR -Ha --relative ./{} ~/path/to/dest-dir/

# for ssh use:
cd src-dir
find . -type d -o -type f |
  parallel -j10 -X rsync -zR -Ha --relative ./{} /{} fooserver:/dest-dir/

# Those flags [-zR -Ha] are just an example - it is just a very preliminary draft I briefly used for something else.

2 .The second option is related to parsyncfp and parsyncfp2.
http://moo.nac.uci.edu/~hjm/parsync/

I admit that I have never tried this software.

3 .The third option is related to Oracle File Storage Parallel Tools.

I have been using it for something else then Storj and I had experienced very positive results. However, the environment was specific and I do not have any hard results about the speedups apart to the feeling that it seemed to be visible and rather significant.

The follow up discussion, as I understand it, provided some not favorable opinion about the first option when used in standard local environment, it is when syncing files between two local drives or two directories on one local drive and was related to drive limits in terms of a seek latency.

To be honest, when I was preparing those notes, my use case was mostly related to transferring files over internet, not only on a local machine.

I am wondering if you have any experience with parallelization of copy / rsync locally or over internet and if you may provide any additional comments or even better measured results.

I believe this might be useful as not everybody is transferring “entire filesystems” and rsync as far as I know is the recommended method by Storj. Thus I am making it a separate topic with a hope for some substantive discussion.

3 Likes

I would not support a parallel rsync from the one drive - your drive will struggle with seek time (latency) and you will split IOPS, so in general it will took more time than just a single rsync, if we ommit the changing files on the destination and the other rsync process is not aware of that. I would suggest you to test your method first, but my gut says me that it will be disaster, waste of time as a minimum and the multiple transfer of the same data as an edge case, they may also even corrupt data if they would try to transfer the same file simultaneously but with some shift in time.
I have to agree with @penfold , I do not like anything Oracle-related for their behavior, they tend to make a good product, which they recently bought, to be unusable.

1 Like

Well, to bo honest I have been writing my notes with transfer of significant amounts of data over the internet and @arrogantrabbit comments in the other thread made me think particularly about transfers being made on a single drive.

Thus I created this separate topic with a hope that maybe somebody is in need currently to transfer some larger amount of data and will be able to provide some measured results together with a short info about the environment the transfer was carried out.

I would like to add that I think that all this is not only about the seek latency, particularly when the transfer is made over the internet, however, it is just a guess, as I dont have any measured results to present.

Some analogy I am able to present as for now is a) related to my experience with Oracle Parallel Tools and b) in general to my experience with aria2c software that is doing a lot of things over the network in a parallel way.

I am sorry I dont get your comments about some other products being unusable and to be honest I am not interested in discussing any other products that are not directly related to storage node operations in this thread.

4. Another tool to try is wcp [link and link] as suggested by @atomsymbol here.

EDIT:
I just came by two other tools:

5. WDT Warp speed Data Transfer by FB. More info here and here. There is also a wrapper that in theory provides wdt easy build and pretty cool cli [Warp-CLI] however I have to admit that it did not want to work in my case and I was receiving <Illegal instruction (core dumped)> after the build process finished (I tried Ubuntu 20 and 22 as well Rocky 8, CentOS8Steam on x86_64 (Arm64 seems not to be supported at all by wdt - however I tried Arm64 very briefly).

EDIT: The script for automatic installation provided by the author of Warp-CLI seems to work perfectly. My problems related to core being dumped were caused by quite old AMD processor. On modern x86_64 machines all went pretty smooth. As for Arm64 some hints can be found here [just a note - I did not make any additional testing on Arm]. Also I cant comment on performance, however, if anybody is interested in transfers of significant number of files over a network, IMO, at least as for now wdt seems to look promising.

and

6. HPN-SSH by PSC with some info here and here.

As for the last two of those tools the performance graph is as follows:


Source: GitHub - JustinTimperio/warp-cli: A CLI tool designed to make interacting with Facebook's Open Source Library "Warp Speed Data Transfer" fast and pain-free.

I would like to sustain my interest in learning more about real life experience with those tools in connection with storj so if you have any constructive [ :slight_smile: ] remarks please do make a post.