I’m curious if anyone has any experience or advice on the following.
As advised in a previous post, I’m now trying to phase out and migrate a 2.5" USB only SMR disk into a 3.5" CMR disk in a 2-disk enclosure, with external power and usb3 connection.
The 2.5" disk is at 75% capacity which is roughly 3.5 TB.
I’m using rsync with a detached screen session to handle to data migration - as advised in the migration manual. But now I’m stuck with the following scenario:
Rsync took around a month to complete the original 3.1 TB. Super long.
Meanwhile data is still coming in and I have to rsync a few times.
I’m now in my 3rd or 4th rsync “session” and the duration doesn’t seem to decline. (I’m now counting the days to keep track of this)
I’m afraid that in the final stage, where I have to stop the node and run a final rsync --delete job my node will be offline for days before all the data has been processed properly. It seems that whatever the change, rsync has a lot of trouble with large amounts of data. And transferring through USB can’t help either. Let alone doing this on a SMR disk that’s also an active node.
Is the some route/way to improve this process? Or maybe consider a different approach? Thanks!
I hope you accidentally swapped the terms in your post, because transferring to SMR is a really bad idea. Nodes really don’t do well on SMR HDDs.
Rsync isn’t the fastest approach, but it’s by far the safest. There is another option. Which is to lower the allocated space of your node to way below the size already used. This stops the node from getting new data. Then use a normal copy to copy the blobs folder. Then you stop the node after that is done and copy all the other data the the new location. That shouldn’t be that much. Then you start the node with the new data location.
I’ve done this, but I also almost messed it up. I don’t recommend it unless you know what you’re doing and I also don’t want to give a more detailed step by step guide. If this isn’t enough for you to figure out how to do it, you shouldn’t be doing it at all.
It’s much safer to just take the down time of a few days and use rsync. How many days are you down to now with the current runs? Keep in mind that it will be a little faster in the last run since there won’t be a node running on the same HDD during that time.
Indeed mixed up SMR/CMR. Is now corrected. I don’t want the SMR drive anymore, so I’m transferring it to a CMR drive.
Thanks, I won’t do the copy approach. I don’t feel comfy with that Good advice.
I just started counting with the last one, so I don’t know yet how long this 3rd round will take. I felt the last one took around 1 or 2 weeks. I check in now and then to see if the task is still pending in the nmon top-processes list.
Super long indeed, that feels abnormally long for such a transfer!
I mean even though the source drive is an SMR one, it is still supposed to handle reads normally, I’m really surprised it took so long.
It’s probably a good idea to follow @BrightSilence’s recommendation and try to sync up the thing while the node is off. Taking a node off-grid for a few days isn’t a big issue, but be aware it will make you lose some data (as some pieces will get migrated from your node to some others while yours is offline).
Just make sure it doesn’t stay offline for more than 30 days or it will get disqualified.
It’s still using the same read/write head to write the data the node is getting, shaking around like crazy to shuffle those shingled tracks around. And every once in a while, the head gets some time to do some reading.
Come to think of it. You may have a much better time with the rsync method if you lower your nodes allocation below what’s already stored as well, to prevent most writes of new data. It can take a while before you see this benefit though. Chances are the CMR cache on the SMR HDD is filled to the brim and it has to write that to SMR areas as well. This alone can take hours and up to days depending on the size of CMR cache.
Given that the drive is closer to full than empty, yet isn’t that big, cloning the whole partition could turn out to be short enough for downtime to not have much of impact. Though, this is a more risky approach and as you say the target drive already has another node, probably not viable anymore.
For some reason I keep mixing CMR/SMR. The target drive is and CMR. The current node is SMR. That’s the whole purpose of my effort. I’m keeping a disk-per-node policy for myself - to keep things manageable as I’m still learning as an node-operator. Also this seems the preferred way in the Storj forums.
Your approach might be worth it. But I want to see if the rsync job I’m doing now is decreasing in time. Which I just started monitoring properly.
-R, -r, --recursive
copy directories recursively
same as -dR --preserve=all
explain what is being done
copy only when the SOURCE file is newer than the destination file or when the destination file is missing
from man rsync:
-a, --archive archive mode; equals -rlptgoD (no -H,-A,-X)
-v, --verbose increase verbosity
-h, --human-readable output numbers in a human-readable format
-P same as --partial --progress
--progress show progress during transfer
--partial keep partially transferred files
The benefit from using cp instead of rsync for initial transfer is more faster processing of files (the cp command do not collect information regarding hashes of files and will transfer file as a whole piece, if it’s updated or doesn’t exist on destination), but you may mess it up, if you run it in the second and next times, because the cp -r may create another copy on destination instead of updating existing one, depending on how you provided the source and destination paths.
The last one
must be done anyway when the source node is stopped to remove wal and shm files for your databases, otherwise they may lead to databases corruption on start.
Eventually I got to 4 days rsync time and from there took the leap: stopped the node, run the final rsync --delete command. 38 hours later, and the migration is complete. Without the SMR disk also being an active node I felt it went a lot quicker. Still a long time, but acceptable considering what was at stake (3.5 TB!)