Bandwidth utilization comparison thread

kalloritis · October 20, 2020, 1:38pm

Ingress has really picked up but it seems that repair is being a bit spotty

date	Ingress standard	Ingress repair	Egress standard	Egress repair	daily space
2020-10-01	5.43	7.1	19.69	4.55	135.49 TBh
2020-10-02	5.1	10.13	22.53	6.08	149.56 TBh
2020-10-03	5.17	8	22.92	4.25	144.35 TBh
2020-10-04	4.94	12.88	22.61	6.8	134.96 TBh
2020-10-05	5.78	15.49	21.9	11.55	152.21 TBh
2020-10-06	3.02	6.89	23.56	5.85	135.96 TBh
2020-10-07	3.66	3.27	23.87	2.27	150.97 TBh
2020-10-08	6.43	2.91	27.14	2.16	141.06 TBh
2020-10-09	7.43	2.93	27.81	2.14	128.51 TBh
2020-10-10	4.07	3.99	27.99	2.87	143.46 TBh
2020-10-11	3.87	5.13	29.44	3.71	164.17 TBh
2020-10-12	3.76	4.71	32.38	3.28	150.67 TBh
2020-10-13	11.49	14.89	30.31	16.52	137.56 TBh
2020-10-14	21.59	8.03	26.23	5.83	147.76 TBh
2020-10-15	22.86	5.81	27.87	4.27	144.29 TBh
2020-10-16	32.6	8.7	26.11	9.56	153.80 TBh
2020-10-17	32.52	13.21	26.35	12.57	140.50 TBh
2020-10-18	29.97	7.12	27.69	5.38	154.40 TBh
2020-10-19	29.71	2.67	27.03	1.89	148.25 TBh
2020-10-20	15.66	2.25	13.27	1.63	57.64 TBh

SGC · October 20, 2020, 4:46pm

yes it sure does seem to be improving… still very low… sadly… but atleast my storagenode is gaining size instead of decreasing…

wouldn’t mind a few months of 200gb a day tho

dragonhogan · October 20, 2020, 6:31pm

I believe a decrease in repair traffic is to be expected since the previous announcement, mentioned here:

Unless I misunderstood what this was all about…

SGC · October 20, 2020, 6:51pm

@dragonhogan
hadn’t seen that, but i remember there was some talk about the repair… i thought it was the other way…
and tho i think it may have an initial effect on the ingress we see, then the extra performance on the network might very well give more ingress long term because of better performance.

but yeah ingress has been, sad to say the least… for a long time now…

SGC · October 21, 2020, 5:26pm

well my migration went well… did make one minor mistake, had started rsync before i closed the node and then shut it down when i noticed and figure i would be running rsync again… i didn’t
so ended up with a malformed db, removed the --delete parameter from the rsync ran it again, and started the node… i know i would have gotten a few loose files… but meh better that then audit fails
and wasn’t quite sure how else to repair the problem.

so finally room to spare… my problem was becoming that i had so many drives in the server that this was my last chance for migrating / changing my pool configurations before this storagenode grew so large that i basically cannot move it anymore because of my hdd capacity only being 6tb each

was a bit worried about my successrates at first…
but then there are days like this… not to shabby triple 9
been trying to manage an quad 9… but thats pretty tricky

./successrate.sh sn1-2020-10-21.log
========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            1017
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                0
Fail Rate:             0.000%
Canceled:              6
Cancel Rate:           0.015%
Successful:            40240
Success Rate:          99.985%
========== UPLOAD =============
Rejected:              0
Acceptance Rate:       100.000%
---------- accepted -----------
Failed:                1
Fail Rate:             0.003%
Canceled:              8
Cancel Rate:           0.024%
Successful:            32879
Success Rate:          99.973%
========== REPAIR DOWNLOAD ====
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            3926
Success Rate:          100.000%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            790
Success Rate:          100.000%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            7634
Success Rate:          100.000%

just wanted to share my “awesome” numbers… been a lot of work to get to this point lol

was trying to compare my ingress to you @kalloritis
seems i’ve been lagging a bit behind… might be because of poor successrates during migrations, scrubs and also got a couple of vetting nodes that i haven’t accounted for… so my numbers are a bit difficult to evaluate atm…

have to keep an eye on mine because i have had another storagenode land on my subnet… luckily i made him leave.

deathlessdd · October 21, 2020, 6:03pm

Im not 100% sure but it seems your effort to try to get perfect numbers is really a waste of time my cheap rpi4 with a SMR drive with little to no effort of doing anything isnt doing too bad.

========== AUDIT ==============
Critically failed: 0
Critical Fail Rate: 0.000%
Recoverable failed: 0
Recoverable Fail Rate: 0.000%
Successful: 2052
Success Rate: 100.000%
========== DOWNLOAD ===========
Failed: 434
Fail Rate: 1.200%
Canceled: 3271
Cancel Rate: 9.041%
Successful: 32475
Success Rate: 89.760%
========== UPLOAD =============
Rejected: 0
Acceptance Rate: 100.000%
---------- accepted -----------
Failed: 0
Fail Rate: 0.000%
Canceled: 19
Cancel Rate: 0.553%
Successful: 3418
Success Rate: 99.447%
========== REPAIR DOWNLOAD ====
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 35360
Success Rate: 100.000%
========== REPAIR UPLOAD ======
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 288
Success Rate: 100.000%
========== DELETE =============
Failed: 0
Fail Rate: 0.000%
Successful: 413
Success Rate: 100.000%

my second node that is running on super cheap dell

========== AUDIT ==============
Critically failed: 0
Critical Fail Rate: 0.000%
Recoverable failed: 0
Recoverable Fail Rate: 0.000%
Successful: 4391
Success Rate: 100.000%
========== DOWNLOAD ===========
Failed: 535
Fail Rate: 0.353%
Canceled: 8884
Cancel Rate: 5.856%
Successful: 142298
Success Rate: 93.792%
========== UPLOAD =============
Rejected: 0
Acceptance Rate: 100.000%
---------- accepted -----------
Failed: 20
Fail Rate: 0.006%
Canceled: 8833
Cancel Rate: 2.839%
Successful: 302320
Success Rate: 97.155%
========== REPAIR DOWNLOAD ====
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 17165
Success Rate: 100.000%
========== REPAIR UPLOAD ======
Failed: 1
Fail Rate: 0.004%
Canceled: 119
Cancel Rate: 0.486%
Successful: 24370
Success Rate: 99.510%
========== DELETE =============
Failed: 0
Fail Rate: 0.000%
Successful: 16139
Success Rate: 100.000%

both are running on SMR drives
Also each system is running 2 nodes each so the numbers will probably be better on the second drives since they arent SMR

SGC · October 21, 2020, 6:52pm

when it comes to read performance there really isn’t any significant difference between SMR and CMR
i only have CMR drives, so cannot really compare them… would be interesting to see the actual comparable difference between SMR and CMR running on the same node.

it hasn’t just been a hunt for perfect numbers, i’ve been learning how to work with zfs, proxmox and linux in general as i jumped right into all of that when i started up my V3 storagenode…

your numbers look about normal for running with single hdd worth of iops, but haven’t really tested that much, also stuff like moving the storagenodes into containers while still accessing the host storage proved to be a bit annoying… again not really something that serves much of a purpose, but it makes the proxmox graphs a bit more useful…

i’m sure if i knew what i was doing when i started i could have managed to rollout this setup in a day…
hadn’t really done much IT for the last decade or so, thus getting back to it and everything being different was a bit of a challenge…

very happy to start to move away from windows tho… even if i really hate linux sometimes, then there are also many things i really like about it.

without a doubt a RPI is a nice little efficient system, and i duno if i would have built my setup so scaleable if i had to do it again, but now i got a solid base and a well performing system for a storage solution, in case i decide to get a blade center next to get some processing going in my wannabe datacenter.

shoofar · October 22, 2020, 7:06am

Doesn’t ZFS allow to “grow” disks? - There is an option in “normal” RAIDs to remove one drive and insert bigger, let it rebuild itself then repeat the process with all drives. Then when you have all bigger drives inserted you grow the partition to “max”

SGC · October 22, 2020, 7:44am

zfs is different, and zfs on linux doesn’t have all the same features and even if one can remove a drive, then it is a patch at best.

zfs doesn’ have a structure like a normal raid when writing to the drives, so basically when running zfs raidz,1,2 or 3 on a pool then one can added additional vdevs(raidz,1,2,3 or mirror) basically it makes best sense to add vdevs that are similar to the existing ones, tho not required, however even mirrors cannot be removed again from the pool.

its something to do with how zfs handles parity information, if memory servers because zfs has variable recordsize (blocksize) thus you cannot predict the outlines of the different blocks easily, and thus if you have to separate the data in the way that it’s writing is very difficult.

so i had 5 x 6tb drives and 4x 3tb drives all used in the same pool by making 3x6 tb hdds in raidz1 and then 2 additional vdevs of 3x 3 tb (some 6tb) in raidz1
which gave me 12 tb from the first vdev and another 2x6tb from the last two.

on this 24 tb pool on which one can store like 21-22tb so more like 20 when the 10% storj buffer is added.
and because 2 of the drives where 6tb but used as 3tb that was another 6tb lost, the storagenode was 14tb in size and because i would want to expand from a 3 hdd raidz1 to a 4 raidz1 to have less loss to redundancy, then i was locked into buying the same model 6tb hdd’s i had 5 of already, so they would work in harmony with all the others, in theory this is only really relevant pr vdev… and i could have got 5 hdd raidz1 but that wasn’t an options for many reasons.

adding 3 x 6tb hdd’s gave me an additional 18tb which i could use something like 16tb and my node was 14tb and because i would need to destroy my old pool, which i also would need to empty first.
and i didn’t want 1 big drive for iops and redundancy reasons.

then my storagenode was getting to a size where it was becoming near impossible for me to move it out of it current pool, without making the upgrade even larger.

this is partially also why i’m considering in the future maybe only using mirror pools, which doesn’t suffer from the same issues and can be taken apart and added to at will and also re-balanced across the drives.

yes you can keep adding to a zfs pool, but you cannot remove capacity from the pool in case of raidz, thus i was getting very close to being stuck with the current pool until i bought like 3x 12tb
but in that case i would still have 6tb wasted in the pool because i didn’t have enough working 3tb to replace my 2 6tb hdd’s

in zfs on bsd you can remove drives from a vdev, but that will inherently affect the structure of the data on the pool and if one can avoid using it, one should… tho i duno how relevant this is in practical terms, but in theory it seems like a very bad idea… due to how zfs works.

even tho i migrated the node to a temporary span of 3 x 6tb drives, i still wasn’t able to get it move directly into a 2 x 4 hdd raidz1 pool, but had to do it to a 1x 4hdd raidz1 pool and then added the second vdev of 4hdd in raidz1 to the pool.
which means all my storagenode is on 1 vdev and even now a little week later only 80gb is on the added vdev, which hurts my read iops, in most cases until its more balanced out…

again, i’m not even sure i can balance the data between the vdevs, because it’s to complex because of the whole variable recordsize thing, zfs isn’t magic, it has it’s own problems…
in most cases it won’t loose your data, without screaming at you first…but it’s not designed to be run on a medium level… you either run small setups or go large… anything inbetween is just annoying to work with when it comes to maintenance and long term usage.

TL;DR

ZFS / ZoL cannot in it’s current version to my knowledge, remove hdd’s from a raidz setup, due to the issues caused by having variable blocksizes / recordsizes, ofc you can expanding it into using 100’s of hdd’s

had i waited much longer it would had been difficult to dismantle my old pool and migrate to a new one, for more reasons that i mentioned above …

kalloritis · October 23, 2020, 2:48pm

to darken the color of your shading-

You’re allowed to “migrate up” your storage space by replacing drives, running zpool replace tank /dev/disk/by-uuid/a79b9257-7efd-42e7-93a5-404d16d16817 each time you upgrade a disk- you just will not “realize” that increased capacity until all drives within a string (see vdev) are of a common, larger, size.

Eg, if SGC has his 3x6TB vdev string and wants to migrate to some shucked 12TB, he would replace one at a time, run the zpool command which triggers a resilver process, allow time for that to complete and rinse-repeat until all three are migrated. Once that is done, ZFS would see it as a 3x12TB vdev.

The thing to remember is that this is a parity RAID-ism; meaning all RAID like setups that leverage a parity must be allowed to return to a full healthy state before resuming further migrations. It is also to be noted that in-place migrations like this are the most intense way since there is as many resilvers as there are drives within a vdev being replaced, which puts a rather immense strain on all other drives in the vdev. SGC specifically performed, based on his comments, a filesystem level copy instead of ZFS snapshot/restore or zfs send/receive. This allowed him to migrate without excessive load to his current production drives, possibly pushing a drive into a fail state.

As for things like the variable recordsize- this sort of fun can even be configured, although ill advised, at even the dataset level (read sub-directory of pool); just like compression and many other tunables.

SGC · October 23, 2020, 6:10pm

yeah forgot to mention that option, but was moving from 3 drive raidz1 to 4 drive raidz1 so sadly couldn’t do that… when i started using zfs i was of the arrogant notion that i could improve upon the settings… sure there are certain fine tuning that one can do, but i’ve learned that ZFS has very good default configurations, and now for the most part i don’t tinker much with them.

tho i do run 256k recordsize on my storagenode to make migrations faster, didn’t really account for the advantage of using rsync over zfs send / recv, the reason i was using rsync was to zero my fragmentation, not that it was bad… but i had been testing dedup… just on my vhds for my vm’s
but it caused my fragmentation to rise pretty sharply… and haven’t really tried using zfs send / recv yet…

the reason i don’t run larger raidz’s is for iops, with 4 hdd’s in a raidz1 and only 6tb hdd’s it doesn’t take very long to resilver… but haven’t really tried that to much either, but the math behind it is pretty basic… my setup only needs to read 3 hdd’s to generate 1 hdd worth of data for resilvering.
i also kinda like that each pool can be on 1 sas port on the hba… tho not sure if that really matters, but i figured it might… and 4 hdd is a nice spot for utilization of capacity with only 25% lost to redundancy.

i do try to not punish my drives to much, which is also why i like having decent iops, if one can call it that when having 8 drives in a pool and getting 2hdd’s worth of iops lol

does zfs send / recv really create that much more load… or i guess it wouldn’t have the same iops demand as it transcribes everything instead of copying each file individually… but then again if its many small files the would most likely be a record for each file and thus that would almost be the same…

ofc i guess its a lot more sequential transfer instead of spending a good deal of time on seek when using rsync… my experience with zfs is very much still in the lots to learn stage… but i already learned a ton

been pondering raising my metadata % in the l2arc / arc because it seems to give a great boost… maybe… ofc there is the special device option, but then there is the whole, if the special device dies the whole pool dies… thus one would need like a mirror to depend on it really…

and tho i trust my new old pcie ssd to have plenty of safeguards… then i don’t really like the idea that if it dies the pool is toast…
oh yeah and i run sync always to limit fragmentation and with a PLP SSD it’s my backup in case the power goes out… that way the odd’s of file corruption or such issues is basically none existing, does require a pretty good ssd tho…

whats the advantage of using uuid… is that the one you can configure by GPT?
i’ve been using dev/disk/by-id which i kinda like, tho it is a bit annoying that zfs will be very adverse to letting a drive go again… i hear using the GPT name or partition name or whatever is the way to go, because then one can create a new GPT and the identifier will change and zfs will stop trying to use the drive…

is kinda nice that the /by-id name is made from the serial number and thus is printed on the drive sticker by default… which is a nice help at times… never did manage to get that bay led blinking thing to work, thats one of the feature i really miss from running the megaraid software in windows.

any actually if we want to go into details, then zfs per default doesn’t autosize to fit the drives before you turn it on… but ofc that usually happens pretty early and then one doesn’t have to think about it again any time soon, if ever…

ZFS is very awesome, if rather confusing at times… been trying to increase my query depth, because i’m just plain bad at some of this stuff and all my drives are in one … something on the HBA’s some ID thing i forget what is called… and zfs will do query depth’s based on the number of … something with v i think… not volumes, not virtual drives…

but have since found out that it doesn’t seem to be a problem… it was just my system creating a ton of random iops, and thus my ssd’s couldn’t keep up.
sync = always is a tough one to pull off… but it has it’s advantages.

kevink · October 23, 2020, 6:21pm

Isn’t this just the same strain each scrub puts on all drives? And it is recommended to scrub at least once a month.

SGC · October 23, 2020, 6:22pm

yeah i think the scrub comparison is spot on actually…

tho in some cases of larger pools there might be other bottlenecks like computation required to generate the data from the other drives or some such thing… but wouldn’t really stress the drives… so i guess that’s is a mute point … xD

kalloritis · October 23, 2020, 7:43pm

Yes, and yes. They can progress though at different speeds, based on one important detail is one SGC pointed out though:

He’s directly pointing to systems that have massive numbers of drives connected to a singular chassis (think netapp’s old style) and that they’re using a parity based redundancy. It was not uncommon to find a single chassis with 1 or 2 SAS JBOD chassis next to them full of drives too. So you might have a system that was 12 to 36 drives itself, but has 2 JBOD’s attached over SFF-8088 cables (or SFF-8470 if it was an infiniband setup) that added up to another 180 more drives (2chassis of 45 bays, dual drive- eg, 847DE1C-R2K04JBOD or top load 90bay 946ED-R2KJBOD). Also, understand, these are absolutely design specific, niche, setups that stretch the limits of single chassis.

I actually pulled that from my notes of having to reinitialize a drive, the same drive, after a cable swap. If you were swapping out a disk that’s connected to an HBA after a swap to a new disk you’d actually use /dev/daX on freeBSD based systems or /dev/sd{a…g} on linux based systems (proxmox, centOS, debian, etc).

Marvin-Divo · October 25, 2020, 9:37pm

@SGC Could you please share your rsync command? Do you use any specific switches to speed up execution? Thanks.

SGC · October 26, 2020, 6:20am

@kalloritis using /dev/sdX is crazy those definitions jump around like crazy, proxmox doesn’t use the /dev/sdX either it names the partition i believe and then uses that to identity the “drive”

ofc for single drive usage sdX may do fine… but still if you take out or add drive on a regular basis it will not take long before using sdX will bite you in the ass with zfs

if you use the id method when zfs sees the drive it will put it back in the pool…

you can pull the damn drives like playing a piano and it mostly doesn’t care… “mostly”
does get a bit testy if you go below redundancy… xD

i started out trying to break my zfs… figured i would want to break it before i got any significant data on it… lets just say the first few months i was so mean to my setup…

i can recommend removing the l2arc while other stuff is running… thats fun… duno what can actually run but thus far i found out that if i run to many things and remove the l2arc the pool basically stalls…
plan on using it for some tests some time when i get around to it.

@Marvin-Divo i basically just use rsync -aP did use some other fancy -aHAX something, but -aP is basically everything of all one could want rolled into one, in regard to permissions and such.

then i think -B 131072 helps… it basically describes the max blocksize rsync will use or something like that…seems to help when i tried it last time… this time i’m not sure…

mostly it’s down to the size of your storagenode and the iops of your hdd… migrating usually takes a day or so pr 1-2 tb i think… ofc your results may vary wildly…

i’m assuming a regular 7200rpm hdd…

so we run the Rsync -B 131072 -aP /source/ /destination/
the first few times i don’t include the -delete parameter… but after the 3rd or 4th run…

generally i keep running rsync until it’s basically life updated, if its a slow day and the node not to huge then maybe 10minutes… usually takes 4-5 runs, then i run the

Rsync -delete -B 131072 -aP /source/ /destination/
this will delete any files that was deleted from the storagenode while transferring, not sure if running that separately last makes a difference tho, you could just use the -delete parameter on all of them.

then i shutdown the storagenode and … you guessed it… i run rsync again.
while that runs i prep my run command for the new location, and then after 10-15 minutes downtime or an hour depending on how well it goes the storagenode is up again from a new location.

so basically what they say to do in the storj documentation

one thing i will note is that this is a very passive process until the final rsync… you can run an rsync every day and it will be pretty close, not really something one wants to monitor or anything … just needs to start it a few times in shorter and shorter intervals…

the best way to get more iops, is using ssd using mirror pools, using multiple raidz vdev pools, using a l2arc, using a slog or caching in general, or shutting down your storagenode while migrating will generally also speed up the process greatly… even tho it doesn’t eat a lot of bandwidth it does seems to reduce hdd performance quite a lot when running the storagenode.

else writing larger blocksizes / recordsizes will also help when migrating, but this is a bit of a two edged sword, the larger the recordsizes the less efficient your caches and ram utilization will be when working with data from that particular pool / array / hdd.

best way to speed up rsync is to leave … fire it off and then just leave and come back the next day… or next week if it takes that long… it’s free, it always works… and it doesn’t eat hay…

SGC · November 2, 2020, 5:58pm

1st of Nov Total ingress 25GB

(total ingress being ingress + repair ingress across all nodes on a single subnet)

what is you guys seeing…?

kevink · November 2, 2020, 6:56pm

37GB per subnet… or are you asking egress? cause that was 25GB for a 7TB node

TheMightyGreek · November 2, 2020, 7:29pm

I had a bit more than 30GB of combined ingress and about 10GB of egress for 4.8TB of stored data.

Pac · November 2, 2020, 8:29pm

Around 36GB of combined ingress here, for the 1st of Nov.
Around 26GB of combined egress for ~4TB of stored data.