Rsync never catches up, unable to migrate

Hi,

I’ve been trying to migrate a node to a new machine for a while, but the rsync takes so long, and when it finally finishes, data has changed so much that the next sync takes just as long. I can’t seem to catch up. It’s roughly 4TB over 1Gbit lan, no hardware failures detected, no error logged, no processes frozen.

Any tips? Has this been reported before? I couldn’t find anything about it.

I’m going to test using the --size-only flag suggested here to see if I can catch up.

Thanks!

Hi,

Sorry to ask, but you are using the rsync -aP option aren’t you ? this will only copy the changed files, then when you are ready to switch over, use the rsync -aP --delete.

#Edit : Added link to article How do I migrate my node to a new device? - Node Operator

4TB over a 1GB Lan, is roughly 82Mbps and would take around 4Days, 23 hours, the changes in 5 days on a node is roughly 100GB, so the changes should take a further 3 hours - therefore I have to assume;

  1. The Lan is not capable of running at point to point 1GB speeds at 82Mbps or greater - do you have other kit using the link ? is your switch capable of 1GB speeds non-blocking ? are you using jumbo frames ?

  2. The source disk, or destination disk is unable to read / write at a speed of 82Mbps - you will need some decent hardware and disks to hit that sort of speed, plus operating systems - if you are on consumer grade SMR disks for the write, this is going to be an issue, once the disk is over 1/4 full the write performance will drop to 1-3Mbps or worse.

  3. You are potentially better of using a cloning software, like clonezilla and having both disks in a machine connected via Sata3 or equivalent - yes, your node will be offline, but it will be copied far quicker than 1GB LAN.

  4. If the disks are all good, then upgrade your Lan from 1GB to 2.5GB for cheap or 10GB ideal - 1GB lan is really not suitable for moving more than a few hundred GB around, when dealing with applications that need you to be online.

Hope you can get the Rsync to work, I’m sure others will have other suggestions as well :slight_smile:

CP

3 Likes

Thanks for your reply!

I’m using -aP, and will use --delete once I get there, as per the guide.

  1. Network throughput is reported as 936 Mbits/sec by iperf test
  2. Disk being written to is an Exos drive
  3. Noted
  4. The migration is to a 10Gbit capable machine. :smiley:

I’ll keep mucking around, maybe I’ll figure something out.

Reduce your storage node’s allocated disk space to as little as possible, this way you have a chance of preventing ingress while doing rsync, leading to less changes to resynchronize.

It’s fine to just set it to 500GB for the duration of the migration, nothing will get lost.

1 Like

Or just stop the node. A couple of days of downtime won’t cause any major dramas.

1 Like

That sounds a little sketchy, not sure I want to do that. :smiley: Wouldn’t this equate to corrupt/lost data by the softare?

I tried that for a couple of hours, it bumped me down from 100% to 99%. Not sure I’d like to go offline for a couple of days.

Nope. It’s just a declaration of “I don’t want any more data for now”, nothing more. This setting is supposed to be used this way. I’ve got several nodes I want to downsize set up this way, and they slowly shrink as deletes come from the network.

1 Like

it takes a long time to migrate… the first few rsync will take i think its like 2tb pr day due to limits of hdd iops and such…

so you should expect the first rsync to take 2 days, then the next one might take 1 day and then next one like 12 hour… and it just keeps improving… i think i usually end up running like 7-12 rsyncs before the time ends up at a level i consider acceptable.
10-15 maybe 30 minutes… really it depends more on when i have time than when the rsync is quick enough.

i just query them endlessly or something similar.

so yeah long story short just keep running rsync you should catch up…

else shut down the node… doesn’t really matter to much, i had like 3 days of downtime at the beginning of this month, didn’t seemed to matter at all… and you got 12 days runway until your node gets suspended and a suspension doesn’t mean much… just no ingress…

also set your max capacity lower than what you got then ingress won’t slow you down and rsync will catch up easier…

storagenode ingress is like ½Mbit avg max… if even that… closer to 1/4 most of the time or less.
so yeah you will catch up… just keep going…

1 Like

Ah, I see. So in practice, it means stopping and deleting current docker container and starting a new one with the storage flag set to something low?

I’ve migrated plenty of smaller nodes before, many times, when playing around with different hardware. But this migration has been going on for weeks. It does finish the sync rounds, but yeah, can’t ever catch up for some reason.

Yep! Start your container exactly the same way as before, just with a lower number for the STORAGE setting.

1 Like

Thank you! I think this is the closest I can get to a solution, even though I technically have no idea why it’s happening.

The potential problem I see here is that this speed is reachable when transfering big chunks of data. Storj pieces are millions of very small files, so I highly doubt the rsync process can use all of your bandwidth.

When possible, that’s actually a great idea! :slight_smile: that would run the thing at maximum speed for sure.

Oh waw really? That seems like a lot for good disks like Exos :thinking:
Weird… Something seems off indeed.

1 Like

There is another thing you could try: bash - Speed up rsync with Simultaneous/Concurrent File Transfers? - Stack Overflow (or similar, search Google for parallel rsync).

Rsync cannot run parallel threads natively. So basically this trys to run several independent rsync instances in parallel which would help to max out the existing bandwidth.
This could work, however result must be checked very carefully that really all source data have been copied.

1 Like

You could probably even use --ignore-existing. Since blobs can’t change anyway. Just make sure you don’t have that option for the last run when your node is offline.

2 Likes

Thanks guys, I might try these methods later on, right now I’m running with --size-only flag after setting node STORAGE to less than the amount pf data stored. I’ll give that a day or two to see where I get. I’ll report back for the curious.

1 Like

My own migration script uses rsync -avAXE --inplace --partial --del, but I haven’t used it for quite some time, so please verify whether all the flags make sense.

3 Likes

Well, the migration is finally completed. I still have no idea why one of the nodes was taking so long. Best guess is that the source disk or its interface had an issue I couldn’t catch. Besides that, everything went without a hitch.

Thanks everyone for all your suggestions and support!

4 Likes