Zfs: moving storagenode to another pool, fast, and with no downtime

arrogantrabbit · May 12, 2023, 8:52pm

The universal way to move the node is described in the documentation (it involves multiple rsync passes: one or more passes to transfer the bulk of the data first without shutting down the node, and then another pass to catch up to the current state with the node off). This is an inherently long, IO heavy process, however, if the target and a destinations are hosted on a ZFS filesystem, (including on different machines), the process can be significantly sped up by avoiding rsync and using ZFS send/receive and snapshots instead.

This is a short walkthrough, for reference for the future users.

Assumptions:

storagenode node data is located on pool1/storagenode-old dataset, and databases are on the child dataset pool1/storagenode-old/databases
OS is FreeBSD and node runs in a jail, named storagenode-jail
We are moving the node to a new dataset on another pool on the same machine, named pool2/storagenode-new. If the target pool is on another machine — zfs send/receive should be piped over ssh, and the jail should be exported and imported on the target machine.

The process:

Without stopping the node, create a recursive snapshot of the source node dataset. This happens instantly:
```
zfs snapshot -r pool1/storagenode-old@snap1
```
Send over the snapshot to the target filesystem. This takes huge amount of time (days), but node keeps running, so no hurry here:
```
zfs send -Rv pool1/storagenode-old@snap1 | zfs receive pool2/storagenode-new
```
Here R stands for Recursive, and v verbose, to see progress. If you are running this remotely, it’s a good idea to use tmux, to disconnect remote session without interrupting the transfer. The target dataset should not exist prior to running the command; it will be created.
Now enjoy life until the process is done.
While this has been running, node was adding and removing more data. So, let’s catch up with that too:
```
zfs snapshot -r pool1/storagenode-old@snap2
zfs send -Rvi pool1/storagenode-old@snap1 pool1/storagenode-old@snap2 | zfs receive pool2/storagenode-new
```
Here we create another snapshot, and then send the differential (i) update to the target filesystem. It can take few minutes to an hour – depending on how much data changed since last snapshot.
Now we can finally stop the node and transfer small amount of changes accumulated since the snapshot in the previous step, to fully catch up:
```
iocage stop storajenode-jail

zfs snapshot -r pool1/storagenode-old@snap3

zfs send -Rvi pool1/storagenode-old@snap2 pool1/storagenode-old@snap3 | zfs receive pool2/storagenode-new
```
This shall take very little time, under a minute, since not much changes could have happened during previous short transfer.
Now we have a coherent copy of the node data on the new dataset, but the jail mounts still point to the old one. Modify them:
```
iocage fstab -e storagenode-jail
```
This will open the editor with the fstab config. Edit the mounts to point to the new datasets at the target pool, save, and exit the editor.
Now start the node:
```
iocage start storagenode-jail
```
At this point the node continues running from the state it was last stopped at, from the new dataset. Total downtime was under a minute spent in step 4 and a few seconds in step 5.
Confirm node is up, and destroy the old dataset; it’s obsolete, out of date, and useless now:
```
zfs destroy -rv -n pool1/storagenode-old
```
r stands for “recursive”, v – verbose, and n – dry run. Review the output, and once you are content with what’s about to happen – re-run the command without the -n

That’s it.

Alexey · May 13, 2023, 3:00pm

yeah… but need to tune it (and have a lot of RAM…).
I almost sold on features, but… you know, there is always “but”… you really need much more RAM than with LVM and ext4. And speed, well, it’s not a dream even with a cache… But (again), it has a possibility to recover after a bitrot (in case of not a simple pool of course).
And you cannot (effectively) add a drive to the pool and simple extend it (recalculate parity and re-distribute data unlike LVM). I do not consider RAID0 or a simple volume (striped, extended, JBOD, whatever…), because the node will die with a one disk failure.

donald.m.motsinger · May 13, 2023, 10:25pm

I moved all my nodes to single drive LVMs recently. One of the nodes I moved to a different disk with vgextend/pvmove/vgreduce/lvextend afterwards. So much faster than a single rsync pass and the node can stay online throughout the whole move.

Alexey · May 14, 2023, 3:27am

Keep nodes on a single drive is a bad idea, not only because it breaks ToS, but also they will interfere each other slowing them down. And with failure of that disk they will die all together.

lyoth · May 14, 2023, 4:01am

Node can stay online in an rsync as well.
I have done it many time when moving to bigger harddrive.

Alexey · May 14, 2023, 4:43am

Using LVM, there would not be several rsync sessions and the node will be online all the time.
With rsync you will be forced to shutdown your node to make a final rsync with --delete option. With LVM pvmove this is not needed, because data is mirrored while in move.

However, LVM will not help, if you want to move to a remote device.

nyancodex · May 14, 2023, 7:18am

Well, all I can say is that you should keep your setup as simple as possible, otherwise it can cause many strange errors which god knows how to fix. I recommend ext4 filesystem on a simple Linux distro like Debian, Ubuntu, Alpine; for disk setup please go JBOD, no RAID, nothing. By my experiences those fancy features will become hell some days…

Alexey · May 14, 2023, 8:41am

one node per disk only, otherwise it’s not better than RAID0 - with one disk failure the whole node is dead.

donald.m.motsinger · May 14, 2023, 10:05am

If this was a reply to my post, of course I kept one node per disk. Each disk holds one volume group with one logical volume. I did this now to make moving my nodes to bigger disks in the future easier.

Alexey · May 14, 2023, 10:18am

Sorry, I misunderstood you. I though you placed them all on a one physical volume, which is equivalent of placing to the one HDD.

arrogantrabbit · May 14, 2023, 6:35pm

This approach is not limited to zfs: this can be done on any filesystem that supports snapshots and replication: the goal is to drastically reduce IO by avoiding copying file by file, and instead streaming the entire filesystem, as the hard drives are very good at sequential IO and absolutely horrible at random. (For example, you can do the same thing almost verbatim with btrfs.)

Which brings us to the point @Alexey made:

While tuning is not really a problem or obstacle – it works fine as-is, and can be further optimized for storj, just like any other filesystem (e.g. sync=off), the ram requirement is indeed there.

However, in return, we get scalability: something that is “nice to have” today but unavoidable requirement in the future, if storj is to continue growing. See, today folks can get away with running storage node on a odroid with attached sata disk. Even without discussing filesystems, there is a ceiling of how much IO pressure can the HDD handle, it’s under 200 iops. at 4k this is under 1MBps… Below that node will be barely able to keep up and likely lose races, above that – it will simply choke. People already complain “omygawd filewalker ate all my IOPS”

So, in the future, when the traffic increases, these contraption will become unsustainable. Adding multiple disks together into load balanced config helps with IO linearly at best.

This one reason. Another – you still want to keep metadata in ram on any filesystem, including ext4, to avoid unnecessary IO and extra latency to first byte. So ram requirement grows linearly with the amount of data you store on any fs.

This is where other features of the zfs help a lot – such as ARC and special vdev. Everyone is more or less familiar with arc by now; but it’s the special device that brings the most performance improvement for storage node usecase: it allows to keep all metadata and small files under certain threshold on a separate (small) drive(s), such as fast SSDs. Not only this drastically, by order of magnitudes, not just linearly, reduces IO hitting the HDDs, but also now you don’t have have to keep all metadata in ram – reading is from SSD is still lightyears faster than from HDD, just like reading from ram. So, zfs, actually, helps reduce memory requirements in this case.

Ultimately, in my mind, there is no alternative going forward as storj grows. we’ll hopefully see more traffic, and therefore more IOPs, and therefore ZFS or similar filesystems will be either a requirement, or, at the very least, strong recommendation, that almost completely eliminates impact of hosting a storaagenode on your hardware, that is there sized for other reasons and purposes.

On other points:

With storage node, caching provides limited help, as soon as all metadata has been ingested: due to random nature of IO.

I don’t think this is important. You could use zfs pool with a 1-disk VDEV and 1-disk special device. It rots – fine. Network can tolerate that, it was designed to.

If we are talking about pool for storj – you definitely can.

First, once you accept that bit-rot is not something to be feared with storj data – you can use single disk vdevs.
And second: you can add new vdevs to the pool instantly, no checksums recompilation required; incoming data will be load balanced between all. (Those vdevs can be mirrors or raidz devices if you want to prevent rot). It’s much faster and better and more scalable that traditional raid

Understood. I would, if I had separate pool for storj: node dies – so what? I’ll start a new one, that’s the whole premise. But I don’t, beause I don’t buy separate hardware for storj, I use what I have, and I have a pool for my needs that is not yet fully utilized by those needs, and therefore it is fully redundant and performant array, that storagenode gets to benefit from. But I don’t have anything against non-redundant pools for stroagenode.

The key is to avoid copying file by file, so any solution that allows to stram filesystem is a win in the context of this topic.

This is until the traffic picks up and exceeds your HDDs max IOPS, at which point the node will choke. ZFS is as stable as anything can be, so I would not worry about risk that comes with fanciness – the whole premise of the filesystem is infinte scalability, and like anything else that came from SUN Microsystems – it’s brilliant. (the last one is my personal opinion :))

Alexey · May 15, 2023, 3:27am

if you mean to use a simple zfs pool without redundancy, it will increase risk to lose all data. But if you would use one disk per node (no stripe or JBOD), then there is an equal risk.

And striped pool in zfs after adding a disk will not start working faster (and usually striped volumes used exactly to increase speed), because there is no recalculation of the stripe size, so it will have the same IOPS as before, maybe less, if added disk has less IOPS.

With raidz adding a single disk to the pool will immediately convert it to RAID0 without redundancy, see Add drive to ZFS pool - HDD's & SSD's - Level1Techs Forums
The best you can do without re-creation is to make it some kind of RAID10 (but need at least two disks): Adding disks to ZFS pool - Unix & Linux Stack Exchange

arrogantrabbit · May 15, 2023, 3:49am

Yes, one pool, consisting of one-disk data vdev + one-SSD special vdev for metadata, per storage node.

By “disk” you mean here “vdev”. When you add a new VDEV to a pool and write new data, data is distributed across vdevs according to the remaining free space and performance. ZFS does load balancing. You can force redistribute/rebalance data with ZFS send/receive, if you really want to. But you don’t have to.

Pools in ZFS are not redundant. Only vdevs are. So if you are adding non-redundant vdev to a pool that so far contains redundant vdevs — you will lose redundancy. It does not make sense to do, and ZFS will refuse do it, unless you override with the -force flag.

However it does not convert everything to “raid0” either — your redundant vdevs are still tolerating disk failures. Still, mixing vdevs of different fault tolerance level is rarely justified.

Alexey · May 15, 2023, 5:13am

What I mean that extending a redundant array with adding only one HDD is not possible in zfs without re-creation or replacing them all to a bigger ones.
This is possible with LVM or BTRFS, but I would not recommend to use BTRFS for any important data, and LVM doesn’t have an error correction, so it’s not so good too.

So in short - there is no silver bullet to simple add a disk to extend a volume without increasing either costs or risks.

LrrrAc · August 13, 2024, 3:56pm

For moving a node from one disk to another, would it work if you put both disks in a pool (not specifically ZFS, maybe snapraid or merger.fs), set the pool to write only to the new disk, and moved everything from the old disk to the new one? Other than databases and caches, which are the minority of the size, it seems like that would allow real LIVE moving of a node. Other than when you have to actually move the databases, but you could just pause the node, move, and unpause (with docker) and it wouldnt notice a difference? If its told to look at the pool.

arrogantrabbit · August 13, 2024, 4:12pm

Too complicated, error prone, Why not zfs send | zfs receive? It’s also live, and much faster, as you are transferring entire filesystems, and not individual files.

LrrrAc · August 13, 2024, 4:40pm

Personally because I use merger.fs on ext4. But practically, because as you move files from one disk to the other, the running node would be putting new files on the new disk, and reading files that have already been moved from the new disk on the fly. You acknowledge above that both rsync and zfs send need to run at least once with the node up, turn node off, run again to copy data that was added during the first copy, change node mounting, restart node. Then you have to delete all the old data (not sure that this would affect total time tho). I just see potential to be able to do it (almost) all in one go, with the node reading data from wherever it is at the time. Plus if you start the copy, then forget for a week, all new data needs to be moved still. If its already placing new data on the new disk, it saves extra copying. Youd just need to transfer files that are generally open, ie db, cache. And you could do that during a docker pause (i think). Thus never actually shutting off the node.

arrogantrabbit · August 13, 2024, 5:09pm

That’s because you have to, because ext4 is an ancient filesystem that lacks modern features, so you have to go out of your way to accomplish a relatively simple task.

Avoiding shutting down the node is not a requirement (on the contrary, it’s an impossibility. See below). Node gets restarted on any update anyway.

No need to change mount points if your target dataset has the same name either.

Of course I’m not doing it manually command by command. I have written a script. I run the script and forget about it. It’s actually running right now as we speak — I added four more disks and rebalancing the pool. It’s been running for 7 days and about 15 more to go.

You can’t pause the process and then yank the file descriptors from under it. This will guaranteed crash the process on resume and corrupt databases.

I don’t understand what’s your aversion to restarting the node? It’s instant. Kill the process, start it again. Takes less than a second.

Anyway, you can’t avoid restarting the node because you need to move databases.

Another problem with that approach is corner cases when you try to move the file that is currently being read or written. Yet another —performance issue resulting from massive IO due to handling files individually. Reliability issue due to having extra lawyer or mergefs (which nobody shall you for anything ever)

The problem you are thing to solve with twigs and sap is already solved in modern filesystems like zfs.

It would be silly to ignore correct way to do things in favor of crutches just because you had to do this way in the past.

Roxor · August 13, 2024, 5:11pm

I haven’t read every post here… but when you can move entire filesystems (and not file-by-file)… just turn off the node. zfs send/receive will pour disks into each other at 150MB/s or higher. A full 10TB node will be done in a day: a 20TB should be two days.

LrrrAc · August 13, 2024, 5:16pm

Well it seems like most people recommend that “ancient” filesystem, but I wasnt planning on arguing zfs vs anything. I was just wondering if pooling so the node sees files at any point during the move would be helpful or feasible. I dont care how its pooled or what the filesystem is. Im not using merger.fs on my storj nodes anyway, its just bare drive. And I do currently stop my nodes. Too much. Im just looking into doing it less.