Zfs + pfsense setup for multiple nodes

MattJE96011 · March 10, 2023, 6:01pm

Sounds like a nice setup! Any chance the router on a VM could be causing any issues? I run pfsense on a full 1U server but I’ve never tried it as a VM. I’ve heard both good and bad so I just decided to avoid it.

Until just recently I also ran nodes on zpools for years, but as the data grew I started getting up there in IO limitations. I considered SSD cache’s which I figured would help, but only to a point. Long term just seemed like a bad idea so I decided to separate them onto their own dedicated spinners. Unfortunately though, and this was one of my main concerns for the future, it took ~3 - 4 months to transfer all the data off 2 zpools due to the pool IO limitations. NOT fun, lol. Actually the last nodes will be done in about an hour… finally. The problem with the cache’s is it only stores more frequently accessed data, so as soon as you have to serve up a bunch of the rest like in a GE or moving a node, that cache isn’t doing anything for you.

MattJE96011 · March 10, 2023, 6:53pm

Precisely why I didn’t want to bother with it. Sure it would help, but long term, the more data hosted the faster these get chewed through. Figured if they can run fine without the extra expense why not. Plus, this way there’s also more room for spinners not taken up by cache drives. Figure the redundency factor isn’t worth the occasional drive lost. I still scrub them though, so if I start getting errors I’ll just attempt a drive replacement before it fails.

Never tried this way. I was worried that being on a large pool that was already struggling with the IO limitation it would still take some time and didn’t want to have the nodes down that long. All worked out using rsync though, just took for ******* ever.

MattJE96011 · March 10, 2023, 7:02pm

Do you happen to use storjnet.info to monitor nodes along with anything else you use? I’ve noticed when I started getting IO limitations which weren’t always very obvious and not necessarily related to the pool itself (not all nodes are always overloaded on the same zpool) I saw the history looking like a barcode of red and green. I also use Uptime Kuma on a VPS but doesn’t show the history the same way so it’s been a good indicator of that. Sometimes I would see nodes drop off the Grafana dashboard if they weren’t responding for long enough but not always. Storjnet.info seems to be more sensative to slowly responding nodes.

SGC · March 10, 2023, 7:19pm

zfs send | zfs recv
is magic… so fast… then ofc i do the last few passes with rsync, until they finish in like 10 minutes and then shutdown the node and do a final pass with rsync before spinning up on the new location.

zfs send | zfs recv takes a bit to get use to, but it works great… basically it only transfers snapshots, so nothing really live… thats why i have to use rsync for the last bit, to avoid downtime.

when i have issues i do ponder spinning up more monitoring, but its been very rare these last couple of years, so haven’t gotten around to it…
got a few others using the internet which has until recently run on the same server as storj is on, so if the server had issues, i would usually get complaints

but no real monitoring aside from the multinode dashboard, which can be very bad for spotting / updating if nodes are offline…

almost did setup uptime kuma on one of my vps’s, but everything has been really stable for a long time… and one of my tactics with running raidz1 is that i got redundancy and all the zfs checksum stuff… which makes stuff super stable due to no random corruption of data.

the special vdev’s i would get rid off… but when the zfs pool looks like this.

zpool list bitpool
NAME      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
bitpool   347T   307T  39.3T        -         -    48%    88%  1.00x    ONLINE  -

making a new pool to migrate all the data to is… slightly troublesome lol

MattJE96011 · March 10, 2023, 7:45pm

I was under the impression it could mess things up if I tried it this way while the dataset was still being modified. Guess I was wrong and maybe this would have been a much batter way, lol. At least I’ll know for next time. I didn’t exactly look into it THAT much since I’m more familiar with rsync anyway and just decided to use that. I’ve only used send/receive on datasets that weren’t currently being modified.

I used multinode dashboard… for about 10 minutes, haha. Now I use my Grafana dashboard which is on a different server but in the same location, Uptime Kuma which runs on a VPS and notifies me through multiple services, and storjnet.info. The only reason I use storjnet is for the reason I stated previously. Uptime Kuma will show latency but not all on the same page and in a quick simple way to see if nodes are laggy (usually from disk IO) in my case which has now been resolved. All together though they tend to narrow things down pretty quickly. The only other thing I plan to do is setup the log exporter. I don’t really see a need for it, it’s more just for the dashboard p0rn.

And yes, that might take a while, haha. Although it looks like your about at capacity there bud! I’m guessing your going to need a whole other server at this point. Speaking of which, you don’t think being at 88mph… I mean % might cause any issues there on a zpool?

SGC · March 10, 2023, 8:05pm

well you take a snapshot and send that, so the data is static, even tho its not… zfs magic

well since all of mine are on the same pool, they share problems when the problems arise… which can be hell… but it does also allow for massive amount of resources for a singular node, think i can restart and filewalk a single node in like 10 minutes.

i use netdata, been really fond of that… its very good for identifying and tracking issues with the server or storage, been meaning to setup the grafana dashboard… did try setting up proper logging using my own logging script, but that turned out to be a bad idea.
it sort of works, but not at scale… so it runs on 11 of my nodes… just haven’t bothered to turn it off or fix it because there hasn’t been enough issues.

grafana stat p0rn would be nice tho, but the more monitoring one adds the more resources is also used… tho i’m sure the system would handle it just fine… just haven’t really had the need… ofc i also have my proxmox graphs and such to help out.

80% is what people say is the limit and it does cause fragmentation to go to much above that and the pool gets slower… generally HDD’s because of how the disk geometry go from 100% while empty to about 50% speed when full, less disk moving under the head when reading or writing close to the center.

95% and ZFS starts to become really bitchy… also got 12TB in trash and GE’s that are running, so space is continually being added, but yeah i’m a bit to close to capacity…
however with the new pricing proposal on the horizon i decided to GE rather than adding a whole another raidz1… 6 x 18TB disk’s adds up if one has to buy them to quickly.

i do have room in my disk shelves for 3 or 4 more raidz1’s…
depends a bit on how my next reconfiguration of the setup goes…
need more pcie slots for SSD’s and two are already used for HBA’s
having 60 x 3.5" bays requires a bit of bandwidth

don’t go over 95% capacity… its a tough spot to get out of, if you can’t just add another raidz vdev

MattJE96011 · March 10, 2023, 8:24pm

Lol… another reason I decided to ditch pools. I 've been running 4 TB disks in a 36 Bay server. I already had this setup before I started with Storj so I created 2 pools from what I wasn’t currently using and went with it. If I’m going to upgrade the disks there’s no point in going to 6, or 8… or 10. Might as well just do 18’s, but the up front cost is stupid for all that space I don’t really need yet, and the resilver time to replace that many drives sucks. So, ditch the pools and buy fewer drives for now and expand as needed. And yeah, with the payout cuts well see what happens. I don’t imagine it’ll be so low that it woun’t be worth doing though or nobody will do it. Storj would tank. So it sucks, but I’m not to concerned.