Rsync never catches up, unable to migrate

Hi,

I’ve been trying to migrate a node to a new machine for a while, but the rsync takes so long, and when it finally finishes, data has changed so much that the next sync takes just as long. I can’t seem to catch up. It’s roughly 4TB over 1Gbit lan, no hardware failures detected, no error logged, no processes frozen.

Any tips? Has this been reported before? I couldn’t find anything about it.

I’m going to test using the --size-only flag suggested here to see if I can catch up.

Thanks!

Thanks for your reply!

I’m using -aP, and will use --delete once I get there, as per the guide.

  1. Network throughput is reported as 936 Mbits/sec by iperf test
  2. Disk being written to is an Exos drive
  3. Noted
  4. The migration is to a 10Gbit capable machine. :smiley:

I’ll keep mucking around, maybe I’ll figure something out.

Reduce your storage node’s allocated disk space to as little as possible, this way you have a chance of preventing ingress while doing rsync, leading to less changes to resynchronize.

It’s fine to just set it to 500GB for the duration of the migration, nothing will get lost.

1 Like

Or just stop the node. A couple of days of downtime won’t cause any major dramas.

1 Like

That sounds a little sketchy, not sure I want to do that. :smiley: Wouldn’t this equate to corrupt/lost data by the softare?

I tried that for a couple of hours, it bumped me down from 100% to 99%. Not sure I’d like to go offline for a couple of days.

Nope. It’s just a declaration of “I don’t want any more data for now”, nothing more. This setting is supposed to be used this way. I’ve got several nodes I want to downsize set up this way, and they slowly shrink as deletes come from the network.

1 Like

it takes a long time to migrate… the first few rsync will take i think its like 2tb pr day due to limits of hdd iops and such…

so you should expect the first rsync to take 2 days, then the next one might take 1 day and then next one like 12 hour… and it just keeps improving… i think i usually end up running like 7-12 rsyncs before the time ends up at a level i consider acceptable.
10-15 maybe 30 minutes… really it depends more on when i have time than when the rsync is quick enough.

i just query them endlessly or something similar.

so yeah long story short just keep running rsync you should catch up…

else shut down the node… doesn’t really matter to much, i had like 3 days of downtime at the beginning of this month, didn’t seemed to matter at all… and you got 12 days runway until your node gets suspended and a suspension doesn’t mean much… just no ingress…

also set your max capacity lower than what you got then ingress won’t slow you down and rsync will catch up easier…

storagenode ingress is like ½Mbit avg max… if even that… closer to 1/4 most of the time or less.
so yeah you will catch up… just keep going…

1 Like

Ah, I see. So in practice, it means stopping and deleting current docker container and starting a new one with the storage flag set to something low?

I’ve migrated plenty of smaller nodes before, many times, when playing around with different hardware. But this migration has been going on for weeks. It does finish the sync rounds, but yeah, can’t ever catch up for some reason.

Yep! Start your container exactly the same way as before, just with a lower number for the STORAGE setting.

1 Like

Thank you! I think this is the closest I can get to a solution, even though I technically have no idea why it’s happening.

The potential problem I see here is that this speed is reachable when transfering big chunks of data. Storj pieces are millions of very small files, so I highly doubt the rsync process can use all of your bandwidth.

When possible, that’s actually a great idea! :slight_smile: that would run the thing at maximum speed for sure.

Oh waw really? That seems like a lot for good disks like Exos :thinking:
Weird… Something seems off indeed.

1 Like

There is another thing you could try: bash - Speed up rsync with Simultaneous/Concurrent File Transfers? - Stack Overflow (or similar, search Google for parallel rsync).

Rsync cannot run parallel threads natively. So basically this trys to run several independent rsync instances in parallel which would help to max out the existing bandwidth.
This could work, however result must be checked very carefully that really all source data have been copied.

1 Like

You could probably even use --ignore-existing. Since blobs can’t change anyway. Just make sure you don’t have that option for the last run when your node is offline.

2 Likes

Thanks guys, I might try these methods later on, right now I’m running with --size-only flag after setting node STORAGE to less than the amount pf data stored. I’ll give that a day or two to see where I get. I’ll report back for the curious.

1 Like

My own migration script uses rsync -avAXE --inplace --partial --del, but I haven’t used it for quite some time, so please verify whether all the flags make sense.

3 Likes

Well, the migration is finally completed. I still have no idea why one of the nodes was taking so long. Best guess is that the source disk or its interface had an issue I couldn’t catch. Besides that, everything went without a hitch.

Thanks everyone for all your suggestions and support!

4 Likes

Is it really required to remove that on the last run?
What I have found is that this option

forces rsync to skip any files which exist on the destination and have a modified time that is newer than the source file . (If an existing destination file has a modification time equal to the source file’s, it will be updated if the sizes are different.)

Wouldn’t different modification time or size be enough to ensure that only the current files exist in the destination?

Much more important to run the last rsync with --delete option when the source node is stopped, to remove temporary database files, otherwise databases would be more like corrupted in the destination.

1 Like