Moving a node with Rsync is terribly slow, idea to improve

I had to move nodes. And using Rsync for this is very slow as it is not well suited for millions of tiny files and for some other reasons.
It is not even the time to transfer that makes it so terrible slow it is the time it needs to compare which files to send.
Here is data of 2 nodes for the final run:
Node 1 : Total received 2.85G, took from Thursday to Wednesday = 6 days
Node 2: Total received 343G, took from Wednesday to Sunday the week after = 11 days

The worst thing of all is that for these final run the node needs to be offline because that is the other problem with Rsync, it is not aware of changes while it is running.
So node 2 was offline for 11 days only to get finally moved. This does not only hurt the earnings but also the online reputation.

There would be a solution:
After the first or second Rsync run where the node is still online, we’d only need a list of files that need to be transferred instead of letting Rsync doing the comparison.
This list has to be provided by the node.

So my suggestion is to create a log mode or a node mode where it writes the paths of all new files in the blobs and the trash folder into a list. This would the record all the changes while Rsync and the node are running. For the final Rsync run instead of letting Rsync go through thousands of folders with millions of files to determine what has changed, we then could simply feed that list of files into Rsync and transfer those files only.

If I had such a file list for my node 1 with only 2.85 GB of files that need to be transferred, I am sure it would not have been offline for 6 days.

What I did is to set the node size lower than the actual size. It stops receiving new data.
Only two rsync runs are needed then - transfer everything, put node offline and compare it to delete files which no longer exists after first transfer. Second run took just few (6?) hours for a node with size close to 10TB.

4 Likes

I’ve transferred a few dozen nodes so far (all sizes up to 10TB). rsync is the correct way to do it. You keep the node running. You run it once with -aP, let it finish. Rerun with -aP --delete-during and let it finish. Stop the node, wait for a bit for everything to settle (or issue a sync on Linux), and rerun it a third time with -aP --delete-during and the node is transferred. The maximum offline time that I can remember (for a 7TB node) is about 4 hours. rsync has been around for close to 30 years now, it can handle millions of files as effective as possible. In its default settings (ie if you use the above), it only compares sizes if the modification time is different (ie to scan an entire directory it only takes as long as doing an ls on that directory, literally a couple of seconds).

I for one do not support adding complexity to the storagenode software. The less code there is, the easier it is to audit it, and less likely for a bug to show up.

5 Likes

Like Mitosos already wrote, rsync takes as long as ls.
If it takes 6 days to rsync, you either have a +1PB node or your node performs bad.
I would even go one step further and guess that something is technically wrong with your node if it takes that long.

1 Like

If you’re moving the whole drive (like to evac one going bad) then use dd/ddrescue. Run a fsck on the source drive (to fix what you can), ddrescue it to the destination drive, then fsck it one last time on the new drive for good measure. Except for any damaged parts it should transfer at 150-250MB/s.

If you need to use a temp location (or are just moving to a larger drive)… try to use SSD for any image files. Like you could ddrescue the old drive (or just a partition) to an image file on SSD (which is 150-250MB/s), then loopback-mount that image to “see” the normal files… then copy/rsync it to the destination drive (and only eat that slow final copy of millions of files once, not twice). The downside to this is if you’re copying a 10TB HDD not everyone has 10TB of SSD space for a temp file.

Basically you’re trying to move partitions as large-continuous-blocks, because it’s way faster. Even if you’re only moving a Storj directory… it can be faster to copy an entire partition… then delete any files you don’t want once the copy completes. (Because you don’t care that you may have copied millions-of-extra-files quickly, because you don’t care how slow they are to delete, as long as the Storj node is happily running in the background)

2 Likes

If you make sure to stop ingress before rsync, then—perhaps surprisingly—there would be no new files in the blobs directory to compare against. As such, it would be enough to rsync databases, orders, etc.

1 Like

Keep running rsync on a live node until it completes in a few minutes, and only then shut down the node to do the last one.

Filesystem already does it. After few rsync runs all the metadata is in ram. Comparing the difference will be very fast. You can also tell rsync to only compare

What is the filesystem there?

1 Like