Zfs discussions

BrightSilence · April 20, 2020, 10:14am

They do tend to have sizeable caches as well as PMR regions on the disks to be able to buffer a lot of writes efficiently. I’ve not seen significant enough testing to determine whether that is enough, but changes to files are the biggest issue. Any change to a file blows up the amount that needs to be written by the entire shingled section. Adding new data to empty space has barely any downside especially if they are relatively large continuous writes like the 2.3MB pieces we mostly see on nodes. So as @Pentium100 mentioned, the random writes are likely to have an outsized impact on these drives.

The problem with this is that the node determines free space by looking at the available space in the storage folder, which at that point would be on the smaller SSDs. So you’d have to work around that as well then.

This is a good point, I hadn’t though about that. However, I think people will quickly find out that the added latency isn’t a great idea either. Additionally using network protocols and entire other devices no doubt adds even more additional points of failure.

Because unlike the databases, you can and should back up your identity. This is also mentioned in the setup instructions. It’s the only piece of data that can safely be recovered from backups.

Derkades · April 20, 2020, 10:16am

I don’t, I am probably missing something but I don’t understand why. I store my identity on the same disk because if the disk dies and I lose the blobs I assume have no use for the identity?

You are right that the identity is the only thing you can make backups of, it’s not possible to make backups of the the databases and blobs.

BrightSilence · April 20, 2020, 10:18am

In that case a backup won’t be as useful. Unless you’re protecting against the unlucky event of that specific sector being damaged, or human error that removes or overwrites the identity. Many people have the identity on a system disk and data on another disk though. I those cases backups are more important so you can recover the node if the system disk fails.

fmoledina · April 21, 2020, 1:49am

@SGC, following your experience I thought I’d try a recordsize=1M on my storj zpool dataset as well. I also changed compression from on (i.e. lz4 default) to gzip-6 (i.e. default gzip level). I’m just rsyncing one of my full nodes over to it and I’ve transferred 1TB out of 1.5TB so far. Just checking the compressratio and it’s a marked difference.

Original pool: bulktank2
Destination pool: tank3

$ zfs get compression bulktank2/storj/storagenode1
NAME                          PROPERTY     VALUE     SOURCE
bulktank2/storj/storagenode1  compression  on        inherited from bulktank2

$ zfs get compressratio bulktank2/storj/storagenode1
NAME                          PROPERTY       VALUE  SOURCE
bulktank2/storj/storagenode1  compressratio  1.02x  -

$ zfs get compression tank3/storj/storagenode1
NAME                      PROPERTY     VALUE     SOURCE
tank3/storj/storagenode1  compression  gzip      inherited from tank3/storj

$ zfs get compressratio tank3/storj/storagenode1
NAME                      PROPERTY       VALUE  SOURCE
tank3/storj/storagenode1  compressratio  1.33x  -

I definitely didn’t expect any compression possibility, given that the data is encrypted prior to distribution. I have 7 total nodes on various zpools and I’ll definitely be shifted data around to take advantage of this. To properly test this, I’ll run tests where I just change on thing to tease out the impact of recordsize vs compression on the datasets.

Thanks for the tip!

SGC · April 21, 2020, 5:11am

not sure why moving data between datasets are this slow, but it’s going in the right direction… not sure if i can rsync a live storagenode without causing issues.

there are some drawbacks with running larger record sizes tho, or so people says online.
the cache or ram allocation, is based on record sizes also…thus one runs into some over allocations of cache, which may slow down some IO related stuff

i tested gzip-9 and lz4 and for some reason i get the exact same compression ratio on both… just like you seem to with 1.33x which is the same compression ratio that @kevink reported.
i’m however seeing 1.29x i’m thinking that is down to my blocksize or async which i think is at 12 for me while many seem to have it at 13, but couldn’t tell you, somebody inhere most likely know why i might be seeing that…4% improvement kinda want that… but for now i’m going for the 1.29x so i can have enough free space to do this again a few days if need be…

copying a 3.4TB folder when one is running out of space is kinda rough, and doesn’t help it goes so slow that i have to account for an additional 500-700gb of added data before it’s done…
was kinda running out of room here… but last chance to do this before this dataset grows to large on this vdev for me to run it in a simple way.

alas i digress…

gzip vs lz4 was no change for me in compression ratio, which is really odd… i’m guessing it has something to do with that it’s encrypted or compressed data in advance and then the both can apply the same basic compression stuff and no more…

so i would highly recommend staying with lz4 on 1m recordsizes, or atleast for me at gzip-9 its quite costly to compress and decompress, and i cannot see a reason to run it.

i tried to go higher in recordsize, but 1m is the max in ZoL atleast.

kevink · April 21, 2020, 5:17am

That is what I would have suggested to you and @fmoledina . LZ4 is a lot faster than gzip and apparently it doesn’t make a difference in compressratio.

There may be some hacks but I wouldn’t try it. Also most data is in the range of 1-2MB so it is already quite good.

Oh well, if this is the case, I can live with it. My ssd cache is 2x20GB, only the RAM is a little “short” with 16GB overall but most of the time I still have 4GB to spare.

Yeah I’m just copying my 2.7TB node onto a zfs drive while it is live… it’s going to take at least 34 hours to finish… And then I have to sync it again for probably 8 hours, then take the node offline and sync again for 2 hours xD That process is a real pain… (And actually I have to sync it back afterwards because I just wanted to convert that node from ext4 to zfs lol)

SGC · April 21, 2020, 5:26am

rsync seems faster than cp, but i’m sure everybody knows that … just me being green lol
yeah i expect to run rsync while online then after its done, go offline the node and run it again hopefully much faster because its 99% done, then try to power up the node in the new location and see if it runs without issues.

i can just feel this is a bad idea… my webdash also died, tho that might be unrelated to the moving dataset and more in the my fault category.

will take a long time to rsync thats for sure… last night using cp, i moved 140gb to lz4 in 1½ hour… so rough estimate would be 38hr but with rsync i think it should be done in 24hr

maybe it isn’t just checked at i’m at like 800gb transfered since last night… which i’m not sure exactly when i started… the rsync, might not be much faster… even tho it looks like it is a bit atleast.

Pentium100 · April 21, 2020, 5:28am

Different datasets are like different filesystems - moving a file means copying it to the new location and deleting the original.

However, if recordsize=1M makes the data compress that’s cool, I knew I should have used zfs inside the vm as well :). I guess when I decide to expand the array I will convert this as well.

SGC · April 21, 2020, 5:32am

yeah the real question is if we get to keep this… 33% saving is a lot… storj could do this at the sat level and then they should save basically half their expenses on data storage… so yeah i would love if we kept this, but i doubt we will… its to good.to be true… also maybe there are good reasons not to run 1m

fmoledina · April 21, 2020, 5:34am

Interestingly, the transfer just finished with a slightly disheartening, although more of an expected response. My 1.5 TB node went from occupying 1.38 TiB on the original pool to occupying 1.37 TiB on the destination. It appears that compressratio isn’t reflecting the actual outcome, at least not the way I would expect compressratio to apply. Having said that, the source dataset has compression set to on (i.e. lz4), as with all my datasets.

rsync seems faster than cp, but i’m sure everybody knows that … just me being green lol
yeah i expect to run rsync while online then after its done, go offline the node and run it again hopefully much faster because its 99% done, then try to power up the node in the new location and see if it runs without issues.

Yes, I’m using rsync. Usual tactics, which I think the Storj node migration guide also covers.

rsync while the node is live. Set ionice if you’re concerned with IOPS being impacted.
After the live rsync is done, turn of node. Do a second rsync to capture all the data in a fully quiesced state.
zfs set mountpoint= to mount points to the new dataset and start the new node.

kevink · April 21, 2020, 5:35am

Yeah I was thinking about that as well… Will storj pay me 7TB used on my disk or 9TB of actual file size…
But at least it opens the options for more egress since more data is stored.

kevink · April 21, 2020, 5:37am

Good point. Have to check that as well. But since my node already got lots of new data, I can’t compare it properly. It now reports 756G in df -h and 805GB used in STORJ… but that may not tell anything depending on how STORJ calculates the storage used.

fmoledina · April 21, 2020, 5:48am

I’m fairly certain that that’s a GiB to GB conversion more-or-less.

kevink · April 21, 2020, 6:33am

Hmm I found old outputs in my ssh client and it actually seems that the reported used space was exactly the same on my old drive (compressratio=1.06%) and on my new drive (compressratio=1.33%). This should have made a difference…

So I ran commands to find out the used space on the filesystem and the actual file size:
sudo du -chd1 /sharedfolders/STORJ2/ --apparent-size
sudo du -chd1 /sharedfolders/STORJ2/

And they both report almost the same number…
I confirmed that on a bigger STORJ node too… 760G vs 756G so 0.6% of actual compression…

So, I think, despite what zfs get compressratio tells us, there is actually no compression at all, which sucks
Guess I got us all excited for nothing… Well, the recordsize of 1M will reduce IOps needed and might enhance the performance of the node a bit but that is probably all we can gain from it.

SGC · April 21, 2020, 7:05am

i did 4 tests
when i realized how slow it was to copy between datasets, i keeled the cp which ended at 108gb
then i used that as my simulated node data, did 2x gzip-9 one from lz4 which gave the same number and so i was confused if it even worked…

i then decided to copy it on a noncompressed folder, which made the 108 take up 140gb
from there i copied the 140gb into a new zgip-9 folder and ended with a dataset of 108gb again.
the compression ratio seems to be the multiplier that one applies to the folder size to get the actual data stored.

thus a 1.33x compression ratio means a 25% improvement from the original uncompressed dataset.
kinda makes sense, but figured i would check to be sure of how to interpret the numbers.

i mounted the zpool, so just need to change the docker run parameters.

My iodelay on the cpu looks terrible, but the node and everything else seems to run fine…
the hdd vdev seems rather bored tho, the ssd is kinda busy but nothing crazy…

i used rsync -aHAX because i wasn’t sure what existed in the folder… sounded like that should take all metadata and such but duno…

@fmoledina can ionice improve transfer speeds? or does it just slow it down

Alexey · April 21, 2020, 7:49am

Impossible. The data is flowing directly between nodes and customers. Satellites are address book, audit, repair and payments services.

SGC · April 21, 2020, 8:23am

33% saved then i would like to say… thank you xD

fmoledina · April 21, 2020, 8:24pm

My intent was to suggest that you can use ionice to slow down the transfer if it’s impacting Storj operation. If you’re looking to speed up transfers, one way to possibly achieve that is using msrsync. I use it as my initial shotgun transfer on the live system, followed by a standard rsync to capture the changes. There are some key caveats listed on the Github repo, but the key ones are as follows:

No deletes
No transfers directly to/from remote connections

In the case of migrating my Storj node between different datasets to capture the new recordsize and compression settings, my actual steps would be as follows:

$ msrsync -P -p4 --rsync "--numeric-ids -aHAXx" \
/old/storagenode1/ /new/storagenode1/
...
# stop storagenode1 after the above is complete

$ rsync --human-readable --progress --stats \
--numeric-ids --delete -avHAXx \
/old/storagenode1/ /new/storagenode1/

I run these two zpools with 3-disk raidz vdevs so there’s some performance to be gained by multi-streaming rsync this way. YMMV.

That’s exactly what my conclusion is as well. Hopefully there’s slightly less load on disk with the increased recordsize but there aren’t any space benefits, despite what compressratio is telling us.

SGC · April 21, 2020, 9:22pm

xD well the rsync commands like human, readable, progress and such will be an awesome addition… for now i’ve been relying on netdata to tell me a gross estimate of how far i am… look like i got less than 14 hours left… of my way way to long rsync transfer, and i wasn’t running lz4, so i cannot speak to if the 1m recordsize does much… but it would sort of make sense that larger segments of data can compress more, but i duno… very green in all this zfs stuff…

been trying to figure out why my zfs has such a riddiculous transfer speed between datasets…
i could upload it to another computer on the network and transfer it back faster than how long it takes to move between two datasets on the same array…

apperently it might have something to do with disk sector sizes, random io of my vdev consisting of 5 drives, async, and blocksize not to be confused with record size lol… makes my head spin just trying to figure out whats causing the slow transfer…

alas the server is running pretty great compared to what i would have expected… i’ve been moving out a tb worth of data from the pool at 40-50 mb a sec, then i decided why not fire up my local mediaserver and tested that doing transcoding streaming at 1080p while storj, network transfers + my netdata monitoring and everything just ran flawless… sure there is a bit of a delay as the ARC figures out what i’m doing… but runs close to good… got my NUMA setup pretty good now i think, that took away my issue with streaming while putting load on the servers network.
But i optimized so many things by now, that i have no clue about which was really to deciding factors.

i can see from my netdata that its my ssd L2ARC/Slog / OS drive that is being stressed.
i really need to migrate the OS away from that… kinda want to move it to the zfs pool… but that comes with it’s own can of worms… like the issue that the server cmos boot sequence likes to prefer a drive and don’t like to boot on others, so if that drive dies on the zfs pool, then boot sort of dies with it… afaik…

so the solution might be to either use an old hybrid drive i got or a 128gb ssd for an OS boot drive…
i think part of my issue of over taxing the SSD is that because it’s two different partitions, then some things will end up being double or triple duty, first OS swap ( now disabled) the it moves it to the zfs pool which makes it go into the zil or slog write cache and then it from there goes into the arc…
or something like that… thus i end up with like 6+ times the io for one process designed to spead the io over multiple devices…

but it works and the iowait isn’t insane and the storagenode doesn’t seem to really be affected even with all my abuse… a bit surprising… only lost like 1.5% in successrate… but am pretty close to 75% now… makes me feel bad lol when im pretty sure i can get to 85% maybe even touch 90%
but been getting 4mb/s from storj pretty consistent most of the day… which is almost more than i wanted because i’m quickly running my head into the wall of what the pool can contain… but should just barely get below the max… with some 5% free for storj db and transfers until i get the old primary node rsync completed, rechecked while offline and if its an issue i did the math and make sure it will happen tomorrow afternoon… so can deal with it if it comes to close to cause issues… xD

scary that i actually enjoy this … going to go through my cmos / bios on next reboot and using original intel documentation on my cpu’s to try and optimize the server some more… i doubt ill ever go back to consumer gear … this old machine is amazing.
1 decade old and it still beats most consumer gear today… lol WTF… kinda makes me want to try a SPARC.

does these do what i think they do…
–delete -avHAXx

and why avHAXx better than aHAX

i suppose i could look that up myself…

fmoledina · April 22, 2020, 1:49am

What’s the zpool setup? Are you using drives with 4K sectors, and if so, did you use -o ashift=12 at pool creation? You should at least be getting transfer speeds in the order of magnitude as hard drive write rate.

That’s pretty impressive that you dug into that. I’m running basically a stock Ubuntu 18.04 setup with backported ZFS 0.8.3. No L2ARC/SLOG device. I haven’t touched any NUMA settings other than checking that it’s enabled.

I’m there with you regarding the old hardware. I’m running dual E5-2690v2 and it just keeps on going without a hitch no matter what I throw at it. My main issue is limiting the temptation to continuously expand. I’m now really trying to live within 36 hard drives.

I guess -v is redundant with --progress but it’s an old habit. -X is to copy any xattrs, i.e. extended attributes, -H is for hard links, and -A is for ACLs. Again, habit; I doubt the Storj object data depends on ACLs, xattrs, or hardlinks, but it’s just my default on Linux. The -x is for one-file-system, i.e. it won’t copy anything that is mounted underneath if it’s from a different filesystem (or ZFS dataset). Habit of mine to ensure I don’t bring along a bunch of other data that doesn’t belong on that mount when I sync.

Definitely make sure you run with the --delete option on your second run, so that your old and new folders are perfectly synced, including any files that were deleted on the old folder during the initial rsync.