Rsync never catches up, unable to migrate

alfananuq · July 26, 2021, 3:17pm

Hi,

I’ve been trying to migrate a node to a new machine for a while, but the rsync takes so long, and when it finally finishes, data has changed so much that the next sync takes just as long. I can’t seem to catch up. It’s roughly 4TB over 1Gbit lan, no hardware failures detected, no error logged, no processes frozen.

Any tips? Has this been reported before? I couldn’t find anything about it.

I’m going to test using the --size-only flag suggested here to see if I can catch up.

Thanks!

alfananuq · July 26, 2021, 4:39pm

Thanks for your reply!

I’m using -aP, and will use --delete once I get there, as per the guide.

Network throughput is reported as 936 Mbits/sec by iperf test
Disk being written to is an Exos drive
Noted
The migration is to a 10Gbit capable machine.

I’ll keep mucking around, maybe I’ll figure something out.

Toyoo · July 26, 2021, 6:11pm

Reduce your storage node’s allocated disk space to as little as possible, this way you have a chance of preventing ingress while doing rsync, leading to less changes to resynchronize.

It’s fine to just set it to 500GB for the duration of the migration, nothing will get lost.

ACarneiro · July 26, 2021, 6:23pm

Or just stop the node. A couple of days of downtime won’t cause any major dramas.

alfananuq · July 26, 2021, 6:42pm

That sounds a little sketchy, not sure I want to do that. Wouldn’t this equate to corrupt/lost data by the softare?

alfananuq · July 26, 2021, 6:43pm

I tried that for a couple of hours, it bumped me down from 100% to 99%. Not sure I’d like to go offline for a couple of days.

Toyoo · July 26, 2021, 6:49pm

Nope. It’s just a declaration of “I don’t want any more data for now”, nothing more. This setting is supposed to be used this way. I’ve got several nodes I want to downsize set up this way, and they slowly shrink as deletes come from the network.

SGC · July 26, 2021, 7:18pm

it takes a long time to migrate… the first few rsync will take i think its like 2tb pr day due to limits of hdd iops and such…

so you should expect the first rsync to take 2 days, then the next one might take 1 day and then next one like 12 hour… and it just keeps improving… i think i usually end up running like 7-12 rsyncs before the time ends up at a level i consider acceptable.
10-15 maybe 30 minutes… really it depends more on when i have time than when the rsync is quick enough.

i just query them endlessly or something similar.

so yeah long story short just keep running rsync you should catch up…

else shut down the node… doesn’t really matter to much, i had like 3 days of downtime at the beginning of this month, didn’t seemed to matter at all… and you got 12 days runway until your node gets suspended and a suspension doesn’t mean much… just no ingress…

also set your max capacity lower than what you got then ingress won’t slow you down and rsync will catch up easier…

storagenode ingress is like ½Mbit avg max… if even that… closer to 1/4 most of the time or less.
so yeah you will catch up… just keep going…

alfananuq · July 26, 2021, 7:20pm

Ah, I see. So in practice, it means stopping and deleting current docker container and starting a new one with the storage flag set to something low?

alfananuq · July 26, 2021, 7:21pm

I’ve migrated plenty of smaller nodes before, many times, when playing around with different hardware. But this migration has been going on for weeks. It does finish the sync rounds, but yeah, can’t ever catch up for some reason.

Toyoo · July 26, 2021, 8:12pm

Yep! Start your container exactly the same way as before, just with a lower number for the STORAGE setting.

alfananuq · July 26, 2021, 8:42pm

Thank you! I think this is the closest I can get to a solution, even though I technically have no idea why it’s happening.

Pac · July 26, 2021, 9:44pm

The potential problem I see here is that this speed is reachable when transfering big chunks of data. Storj pieces are millions of very small files, so I highly doubt the rsync process can use all of your bandwidth.

When possible, that’s actually a great idea! that would run the thing at maximum speed for sure.

Oh waw really? That seems like a lot for good disks like Exos
Weird… Something seems off indeed.

jammerdan · July 27, 2021, 4:16am

There is another thing you could try: bash - Speed up rsync with Simultaneous/Concurrent File Transfers? - Stack Overflow (or similar, search Google for parallel rsync).

Rsync cannot run parallel threads natively. So basically this trys to run several independent rsync instances in parallel which would help to max out the existing bandwidth.
This could work, however result must be checked very carefully that really all source data have been copied.

BrightSilence · July 27, 2021, 6:32am

You could probably even use --ignore-existing. Since blobs can’t change anyway. Just make sure you don’t have that option for the last run when your node is offline.

alfananuq · July 27, 2021, 11:06am

Thanks guys, I might try these methods later on, right now I’m running with --size-only flag after setting node STORAGE to less than the amount pf data stored. I’ll give that a day or two to see where I get. I’ll report back for the curious.

Toyoo · July 27, 2021, 3:42pm

My own migration script uses rsync -avAXE --inplace --partial --del, but I haven’t used it for quite some time, so please verify whether all the flags make sense.

alfananuq · August 27, 2021, 10:17am

Well, the migration is finally completed. I still have no idea why one of the nodes was taking so long. Best guess is that the source disk or its interface had an issue I couldn’t catch. Besides that, everything went without a hitch.

Thanks everyone for all your suggestions and support!

jammerdan · July 2, 2022, 6:41am

Is it really required to remove that on the last run?
What I have found is that this option

forces rsync to skip any files which exist on the destination and have a modified time that is newer than the source file . (If an existing destination file has a modification time equal to the source file’s, it will be updated if the sizes are different.)

Wouldn’t different modification time or size be enough to ensure that only the current files exist in the destination?

Alexey · July 2, 2022, 7:44am

Much more important to run the last rsync with --delete option when the source node is stopped, to remove temporary database files, otherwise databases would be more like corrupted in the destination.