Storj Node expansion

jammerdan · December 6, 2021, 12:22pm

It is a very interesting idea I hadn’t thought of. I am currently moving some nodes and it is a pain with Rsync. The problem is indeed that it takes ages to copy if it is a large node. And once it is finally done it has received so much new data and deletes that the process basically starts from the scratch. It is faster if you can shutdown your node, but as it is assumed that after only 4 hours the process of repair will start, it is also not such a good way.

2 things are absolutely terrible with Rsync that is it is not multithreaded and that it cannot move files on the destination. Because when the node deletes a file it is a simple move process, but what Rsync does is to delete the data from the destination folder and re-uploads the same file to the delete folder. Rsync does not know that the data only has been moved and cannot mimic that on the destination. Rsync has some good features so like the delta compare, but this does not help much as a file once written does not change on the node.

Honestly I came to the conclusion that the node software itself need 2 things:

A maintenance mode for such operations in which you don’t get penalized if you shutddown or your node gets slower.
But more important a built in function to move the node data. I have not thought it through yet, but Microsoft has a very convenient process to move running (!!!) virtual machines disk to another drive without interruption. I think they keep reading from the old data location until the process is finished, but are deferring already writes to the new place. Something like that would be brilliant for moving a node as it would stop the source data from constantly changing.

BrightSilence · December 6, 2021, 12:33pm

If large is the only problem it should have less to copy with each run and comparing the files should be pretty fast. My scenario was different in that I knew the comparing is actually what would take a long time. And that wouldn’t get any faster with subsequent runs. I’m now deleting the old data and I expect that will probably also take days. And considering that this node was only 1.76TB, that should say something about how slow it is.

Yeah, this would be a great option to have for sure. There is no reason why it shouldn’t be possible, the node would just have to check 2 locations for the file for a while. But I guess that would impact quite a few systems, like file walker, garbage collection, and any downloads at the least. Also, I don’t think you would be able to copy the DB’s ever if the node is online. So yeah, the implementation might take quite a lot of effort. And I guess there are ways around it. I do kind of like the idea of just lowering the assigned space and copying blobs only. Then stop the node and copy the rest. But rsync might still be best for most scenarios.

jammerdan · December 6, 2021, 1:15pm

In theory. But in reality this did not work very well and I had to shut the node down to make the transfer finally work. As said I had not the idea to reduce the capacity below actual space occupation which might have helped. But while it was running and constantly changing data, there was no way.

Alexey · December 6, 2021, 7:42pm

Actually there is a way, if you use LVM. You can add a target disk to the same VG, mark the source PV as not allocable with pvchange -xn /dev/sdS (/dev/sdS is a source disk) and move PV to the other disk online with pvmove /dev/sdS.
There is a risk, if the one of these disks would die during the move - you likely lose your data partially or at whole (because it’s a temporary RAID0 with zero redundancy).

jammerdan · December 6, 2021, 8:40pm

Very interesting. This really sounds like it could move a node from one physical disk to another while running.

Alexey · December 6, 2021, 11:11pm

The other way with LVM - is to convert the LV to RAID1 after adding a disk, then convert RAID1 to a linear volume after sync is done, excluding the source disk.

lvm> pvcreate /dev/sdD
lvm>
lvm> vgextend vg0 /dev/sdD
lvm>
lvm> lvconvert --type raid1 vg0/lvol0

wait until the copy 100%

lvm> lvs
  LV    VG  Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lvol0 vg0 rwi-aor--- 1.99g                                    100.00
lvm>
lvm> lvconvert --type linear vg0/lvol0 /dev/sdS
lvm>
lvm> vgreduce vg0 /dev/sdS

This way it may be more robust than RAID0 with read-only source.

jammerdan · December 8, 2021, 7:38am

So now I have a figure: I moved a complete node from one disk to another, sadly only over USB2.0. I started the transfer on Sunday morning and it just has finished on Wednsday morning. So 3 days altogether for a 2.5 TB node.
The node was shutdown. Now imagine with a running node that constantly keeps writing, reading, moving and deleting data this would take much much longer. And if you do a remote transfer, like from local to a remote location via DSL it will take like forever.
What would help I think for a running node (probably not complete):

New writes should happen to new location
Deletes should be to both locations
Data on new location that has been moved should move (not like Rsync deleted and re-uploaded)

BrightSilence · December 8, 2021, 8:51am

Yeah, but you need to already be using it right? Or is there a way to convert a live partition to LVM?

That really doesn’t sound great either. Usb2 certainly didn’t help. I think a big reason for this being so slow is the number of small files. Those work well for operating the node, but they make moving it a real pain. I’m actually still deleting the old files in my setup 3 days later and I think I’m about halfway done. Yours wasn’t nearly as bad but 3 days is still not great.

Maybe lowering the node capacity and just doing a dumb copy of the globe folder while running should be offered as an alternative in the docs? @Alexey how do you feel about that? Or is that a bad idea to suggest to node operators?

One thing I wasn’t entirely sure about with that alternative is what would happen to files that end up in blobs and in trash. Which could happen if a file was moved to trash after it was already copied in blobs. I figured worst case this would lead to errors during GC as long as the trash hasn’t been cleaned up. Doesn’t seem like that big a deal, but should probably be looked into before suggesting it to others.

andrew2.hart · December 8, 2021, 12:41pm

GitHub - g2p/blocks: Enable bcache or LVM on existing block devices

Never tried it tho.

jammerdan · December 9, 2021, 3:53am

True. But even a USB2 disk should deliver maybe 30 MB/s or so. That could make the transfer in around a day vs. 3 days that it took.

Yes Rsync is known to be slow with many small files because of the overhead for each file and all those directory lookups. What might be help and possible to script is to pack those folders in batches and transfer them as tar files and unpack them at the destination.
Maybe only for the first transfer, which sounds trivial to do: find all files, create a file list, pack the files according to file list in batches, transfer, unpack. Then after the initial transfer you’d probably better do a regular Rsync to find all the changes that have occured during initial transfer.

Alexey · December 9, 2021, 5:12am

Depends. See

It’s not bad, but requires to be careful. With rsync the worst case is corrupted databases, if the operator would forgot to do last rsync with --delete option.
In the suggested method there is a great chance to lost data, especially on Windows, where you cannot copy an opened file with usual Explorer or even copy command.
So, you need to use rsync/robocopy anyway at the end. But since the copy command changes the modification date, rsync will likely copy it over again, so you have nothing to save in time.

Maybe we can write a guide to use the suggested method extended with special options in the cp command to preserve a modification time and do the last sync with rsync --delete.
I have a feeling, that resulted time would not change, because most of it rsync seems to spend on calculating a difference, not on copy. This is especially true for NTFS under Linux (it’s known as incredibly slow).

BrightSilence · December 9, 2021, 9:05am

Thanks for the link, I had actually skimmed that post but didn’t look at the link. You weren’t kidding about that being a very precise task. Definitely wouldn’t be comfortable putting that in a suggestion for most end users. I don’t think I would be very comfortable going through those steps to be honest.

Are you sure about that? I’ve definitely copied opened files before. You can’t move them, but copy should be fine. I just copied one now without problems.

If you’re not comfortable suggesting this option without the final rsync, then there is really no point to suggest it at all. That rsync is exactly what I was trying to avoid as I knew it would take weeks. In fact, the system is still working on deleting data… now my case was certainly a perfect storm of worst case scenarios. But I’m sure there will be more who have something similar if not as extreme.

So yeah, what I did was:

Lower the node capacity to way below what is currently in use.
Restart the node
Wait for the file walker to be done (optional, but I didn’t want to put extra load on while the file walker was running)
Copy the entire blobs folder only to the new location (this will take loooong)
Stop the node
Copy everything else to the new location
Point the node to the new path either in config or run command
Rename the old path to ensure the node is not still using some old data
Start the node and monitor the logs closely

This way there is no need to do rsync at all. The downside is that blobs will have a bit of data that will have been deleted. But garbage collection will take care of that eventually.
The node is only down for the copy in step 6 which shouldn’t take too long (depending on trash and db size). So this results in a downtime of only minutes.

hashbackup · December 10, 2021, 4:47am

You might be able to do something like this with a FUSE UnionFS filesystem, with olddir and newdir as the underlying directories and mergedir as the FUSE mount. Probably have to hack it up a little to get the desired behavior.

Toyoo · May 17, 2024, 10:08pm

It’s not RAID0, it’s more like linear. But sure, still dangerous. What you can do as an alternative is to first add a mirror pointing to the new drive, wait until all data is synchronized, then remove the original drive. Then it’s safer, as for the duration of the migration all data is kept updated on the original drive. However, it requires a bit more disk space (the meta volume for RAID1).

Alexey · May 18, 2024, 4:47am

Yes, you are correct. If you do not move, it’s like a mergefs, if you move, it’s a mirror RAID1 during the move.
And I know the safe method of moving data - you need to disallow usage of the source drive, either at the beginning, either when move is finished. Usually it’s not needed, LVM is smart enough to do not use PV extents on the source PV to store new data, and it also likely wouldn’t use the source disk to store new PV extents after the move, if the destination disk is not full.