Bandwidth utilization comparison thread

TheMightyGreek · November 20, 2020, 7:43am

I had about 1.5GB of repair egress on the 19th. I’m on V1.15.3 though.

LinuxNet · November 20, 2020, 7:48am

a bit better. but seems to be getting less again. The short burst was an upload to tardigrade

SGC · November 20, 2020, 7:50am

usually takes a long time for the system to reset / reconfigure or whatever it does, so give it a few days maybe a week’s time…

kalloritis · November 20, 2020, 12:36pm

Has anyone seen a “correction” to the last few days where the TB*h has gone down drastically? This day was at around 110-115 but now nothing, essentially.

andrew2.hart · November 20, 2020, 1:30pm

Well, yes, Wow, that's weird

SGC · November 24, 2020, 10:11pm

getting a lot of deletions and seeing fairly high iops utilization for the storagenode lately… or i think it’s the storagenode… i really need to get better at tracking down stuff like this in linux…

seem that my iops usages spikes at 200k + and sustains 30-40k for like better par of an hour and stuff like that… checked the latency on my hdd’s but nothing over 100ms peaks and most around 20-30

have in the past seen it behave like this because i had a hdd acting up spiking to 2sec of latency…

but yeah it seems to be some of the work the storagenode is doing… maybe it’s related to all the work Storj Labs is doing on the orders.db and cluster migrations… i duno…

just wondering if anyone else is seeing similar behavior…
i checked and triple checked and i cannot find anything indicating a real problem… aside from maybe heavy workload.

else maybe i need to do a reboot or something…

kevink · November 24, 2020, 10:17pm

Can’t say I see more iops than usual but definitely a lot more deletes.

SGC · November 24, 2020, 10:21pm

and then from time to time it will spike into the … wait a minute… remembered something i need to check… i’m pretty sure i did put a max size on my docker log… else that may have become a real drag by now with 600 hours + hours of logs…

my system have been running for a long time, so might be some configuration issue, because i’m not yet use to having it run for weeks on end…

it spikes into the 200k + but that maybe when it deals with the logs… which is only suppose to be once every 10 minutes.

SGC · November 24, 2020, 11:08pm

my bad, seems like it was my log file that had gotten out of control, haven’t fully confirmed it yet as the storagenode is still booting, but i’m pretty sure it worked… and the 200k spikes was when it was my own script exporting to external logs…

a bit surprised that the docker image can run for long enough that it will basically kill the system… might be something that should be fixed, since in theory if activity went up and up then nodes running without external logs might get choked by their log files before the updates…

or maybe it didn’t fix it… no matter ill give it another day, see if it keep persisting… just been pissing me off, can’t figure out what seems to be causing it… but i guess i just keep digging…

dragonhogan · November 25, 2020, 12:37am

yep, I’ve been keeping an eye on my logs for the better part of the last 4 hours and I’d say 50-80% of it has been deletes…a real purge going on here…I first thought was that since the majority of the deletes are occurring on the customer satellites, perhaps that Filezilla promo had ended and they were cleaning up from that for everyone that had uploaded stuff to help test it out, but that link in the Engineering page still takes you to the test signup page and I couldn’t find any stated “duration”. So that’s just purely speculation at this point.

Could just be another clean up that’s occurring prior to or after the migration that was announced since it was stated that something was changing with the billing. Could be a mass purge of data from accounts that had not re-upped their subscription/provided a valid payment…again just speculation. Either way, I am sad to see the data go.

kalloritis · November 25, 2020, 1:51am

Yea, kinda bursty behavior for sure over the past 24hrs (sorry graphs are so big- 4k screen):

kalloritis · November 25, 2020, 4:42am

uhhh, yea… something just kicked off on US-C that is a lot of deletes:

2020-11-25T04:34:45.569Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "PRZOOMOH3GYKTS5PRVP7X6Z5MLGR4FPO7J5F2Y3HBHVW4ADLS4SA"}
2020-11-25T04:34:46.645Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "H7HX7NA44JUCN4LGT3BSAT3BC5QGO27GUEKU7NKHS7AK52CRMI7Q"}
2020-11-25T04:34:46.665Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "A2JPO2KLEKAOMCFQGGBAJWFCS2GRWUZPMYHSPPIPCWCB3ZG2FEMQ"}
2020-11-25T04:34:48.487Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "R45PDOD6F6OTI2H3DMGDQQTSQQ2I4FUO6UF5AZKBDV7JKJJRPHOA"}
2020-11-25T04:34:48.708Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "ZCWNRUZDFEV46YU3BOWS3CS7PM4RTQUKUSM2MSYK7Q764IDMXWPQ"}
2020-11-25T04:34:49.922Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "PI2KOO2FI3XG5LIYL2AOMCWTPVMTUFXS6JIQJLMCDEDTF7SDNXUA"}
2020-11-25T04:34:49.923Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "TNJNP4R5ZOPENI7WONYLT6B64BKM2XMJJIJ4IO3YILNF5FT257AA"}
2020-11-25T04:34:50.614Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "OGWW6VKMDGNDWFYNHYB3C2GRAGRN5OZ65JURWQYUFSTBJGCRRP4A"}
2020-11-25T04:34:50.679Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "7FVMKS2PNH5GVAKWMDSDUORNTU4A3FBFBJA2YKXFF6EZWVB6HATA"}
2020-11-25T04:34:51.638Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "NTVHLMWCND73EGHSGQADMXSEZOEYNXQDDFXYCEDI3KHWV3VAR5YA"}
2020-11-25T04:34:52.397Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "BGTVFIKUYBHNYQZK5SN437GNN75EMV2NTUXMPQXCP5WSUU4QTLOA"}
2020-11-25T04:34:53.722Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "VAFEO23CGSQT2GQX6MTX5VQCDRR2VTAYPXOKIPYXYC5U4CM3EGEA"}
2020-11-25T04:34:53.767Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "3PLWTO4NYCWRLCPPXQMRS5I54XXIYLYEBP7UL42OEDVECO3WDAQQ"}
2020-11-25T04:34:53.788Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "HJZBIXQR4RHTSRNF5VO5MCMFKHEHOCHQLFX5UPPYXUIYT22H3GPA"}
2020-11-25T04:34:54.304Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "3BOLXBTKLOIKTDPL655BXOVQQYMDQFWA2H5ECAVXUQOYCYCIMDJQ"}
2020-11-25T04:34:55.620Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "5C7Y5PNBR4MZIBGZYHNLKB36REAWJZK5VJEFS6TEQD7Q2NWHLPRQ"}
2020-11-25T04:34:56.106Z        INFO    piecedeleter    deleted {"Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Piece ID": "76XDYF3DM3Z4I5FQBDYMBUKJ24TRA5JGCBLRPI5JR3SS73TOR3XQ"}
2020-11-25T04:34:57.238Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "FWBHR3UK3LTEKK54HBJJVLUGMECAK4X7VGT22BMF2BMDS6AWXJDA"}
2020-11-25T04:34:57.464Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "M4B2NWLPOAK7S3EX4CFDMEQABMOF5EMJT4EXXKVZXBU4QIRKYWLQ"}
2020-11-25T04:34:58.828Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "VTTLW3N2W6Q2B3DB44HBYK4BLFNFM4WSLG765TFHNCREH5CMHHUA"}
2020-11-25T04:34:59.373Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "R6NFGUWGXZ7OW3ZUO2DMHJRHYZ3SIHO4JF2LCGM4HJTDD6VZMMEA"}
2020-11-25T04:34:59.401Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "BIY5GPHTBPK5KMFH3ZLEOOALXUPSH2C6GSTYNQLQDZ5CA2VTNIQQ"}
2020-11-25T04:34:59.434Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "33BX7CWAYQ5AAX3XRPAEXAQLFT4X6HEYPEI5UEF3NZQPHRL2HXIA"}
2020-11-25T04:35:00.320Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "QZYQYF7XUMUL5JUEQK44LODDAD6WNXZSRS55QAZLPXGIPFVJNFJQ"}
2020-11-25T04:35:00.393Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "SRIOL6762JTTIBZHUUDED6ZLZHY2BAUCAXRW4K3XR2FAKJDKJTOA"}
2020-11-25T04:35:00.694Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "YGSGBVH6BDEZNMN2DWOQ5CFMIZZ5GZCMMOKTS3UDUWDHQEY3KUSQ"}
2020-11-25T04:35:02.289Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "MUF55IT5KRMPSKV7DLGNT454F5YQ4QJLVZTA2QTZIYN7KHXBIIWQ"}
2020-11-25T04:35:02.307Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "65OIAURNQ66DHJGAVGXVWG2UD3S6L4AJTWHWZCCAULKJUOPVOKYA"}
2020-11-25T04:35:03.583Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "SGFA5WRUWMRSI6XTGXHVNO7L3SOPZXUTLT35UIH4JFWYYKPMDUWQ"}
2020-11-25T04:35:05.882Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "VRQXVFS34LK2QILWQFOM4SBD3WRPP65OQ5HGLIFUJCQOKZEZVWUA"}
2020-11-25T04:35:06.137Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "LBQTCSP6X63L75TKEOBRTHBE5NRQTOC5NPO6PH3OLUH26RTI52ZQ"}
2020-11-25T04:35:06.732Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "EAWMDHS24LMZOAOBLTFAKGZV76XCSL6Q4Q7DS3GDEDYDHNX7BUIQ"}
2020-11-25T04:35:07.141Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "YPQD3DSK7RTKHPV33PUDIUGGRWSAYCD7EQ7DBO2H24UYNMWZYLPA"}

In fact, there’s been so many, both nodes are actually contracting in size again. Currently only a net negative of 6GB per node over the past 24hrs (even with the ingress), but if you multiply that by the over nine thousand active nodes then you could get something like 54TB of piece chunks that are being deleted. Crazy stuff.

zolbarna · November 25, 2020, 6:30am

Same behavior here.

The summary of the last 24h -few GB.

Atleast It’s not only me. Good to know

kevink · November 25, 2020, 8:37am

Maybe it’s just the filewalker process.

Anyways, I know we go offtopic often in this thread but iops don’t have anything to do with bandwidth comparison.

Last 24 hours:
grafik
It’s going down bad… That hopefully means there’ll be some more ingress afterwards…

SGC · November 25, 2020, 9:01am

it’s been happening for almost 10 days now and i’ve been trying to track it down… so not the filewalker… i’m pretty sure…

most likely due to high uptime… my pcie ssd does show some general weirdness in regard to power stability … it claims it has lost power not much more than a few hours ago, which may indicate my pcie ssd card is loosing connection… so maybe i should get my chassis back window fixed so i could screw the cards down…

which might be wise…

it’s just been acting weird lately and in 10 days i haven’t been able to figure it out, so a few days ago i started actively trying to fix the issue and still not having much luck… so figure it was worth checking if it was my system or network wide that was seeing high iops…

i’m still seeing like hundreds of iops 24/7 and only the storagenodes are running
and from what i can tell it’s only the big storagenode, doing like 300 iops on avg… which seems high

and everything is connected…
iops and bandwidth is basically the same thing tho… you cannot have one without the other, iops is what bandwidth consists of … essentially

still basically the same thing…, even tho i killed off 1800 hours worth of log files which was read by docker every 10 minutes when doing the docker logs time based export

that did reduce the peaks every 10 minutes from 250000 iops to 80k iops but it wasn’t the issue, and now all docker log files are reset…

the log iops spike for reference… the others are just … weirdness.

but yeah you are right not really bandwidth related…
ill make a thread for it if i don’t get it solved over the next few days.
or try the dreaded server reboot… but then i loose my nearly finally warm 800gb l2 arc
does seem to be a bit to big to be useful really…

striker43 · November 25, 2020, 9:31am

It‘s going up again

kalloritis · November 25, 2020, 12:45pm

Are you sure you don’t have a scrub going on? Give “zpool status” a check. I’ll admit those are some pretty gnarly spikes.

SGC · November 25, 2020, 4:05pm

now i got a scrub… ended up sitting next to the server while looking at the graphs on a laptop and notice the ticking sound it has been making correlated with the sounds… so i pulled the drive and it might be due to rust in my hdd caddies touching the circuit board of the hdd making a slight short circuit but not enough to make the drive be truly unstable and after having mitigated it with the hdd power management it seemed fine … in netdata tho… max spikes seemed to be 100ms but apperently the rest of the system didn’t agree with that, because zpool iostat -w put it at 500ms
also the issue might have become worse because the root os ssd is on the same port or controller that the “bad” hdd was on…

so i cut a piece of antistatic plastic and put it between the hdd circuitboard and the caddie which seems to have removed the tick sound atleast… for now… sometimes when i turn the drive off and on again it will go away for a while… so… and now it has been resilvered and then i got 1 scrub running and if it finds errors i will have to do another one… so yay…

but maybe a fix… and more thumbs down on netdata… but i love to hate that program, it’s amazing all the stuff it helps one to keep track of… its just not always reliable imo…

so will know if its fixed it sometime next week… i really hope it does… but i kinda doubt it, it’s been an ongoing issue for a long time… but not sure i tried to remove / change the caddie before, and so with it now being fully insulated it shouldn’t be a problem.

i’m also starting to ponder how this behavior would look on a mirror pool… if one spread the mirrors across different controllers or some such thing… it’s also a very minor issue… and iops i duno where is from… tried to reboot and that didn’t fix it either… it’s not the storagenode tho… but maybe it was caused by the hdd being weird… will have to wait and see.

having to work with drives being weird and such have sort of made me ponder to split my pools across controllers… so that each disk of a pool would essentially be one its own controller… ofc i only have three controllers at present… so will most likely try it out with a few mirror pools… and then hope something goes wrong in the near future… or try to build a test setup on another computer using purposely bad drives for the setup, maybe do a dual boot kinda thing to compare it accurately.

the iowait is pretty mean on performance and because the controller basically stalls when a disk doesn’t respond, it creates a lot of related issues, which i would really like to iron completely out of my setup so that i can run drives basically into the ground if i feel like it… without latency performance suffering.

not really every day user issues lol…

kalloritis · November 25, 2020, 11:48pm

so to get this straight… you had a drive micro-arching against the drive sled? that’s static and you’d need to look into the humidity of that space- it sounds like its out of wack and that’s probably not the only thing that’s micro-arching!

SGC · November 26, 2020, 8:10am

no the humility rusted the drive sled and the soldering / diagnostic marks on the circuit board i think was touching the rust… i suppose one can call it micro arcing… but i think you are referring to static discharges, and that’s certainly not a problem… i seem to have managed to get my humidity from 95 to 90% finally…

tho i’m not sure i really trust my humidity measurements, but the room isn’t great and it seems more temperature dependent, it’s basically impossible for me to pull out the humidity in the air because it might get replaced from other sources.

i duno if that was the problem… the drive made sounds… now it stopped… .but i replaced the sled, gave it a spray of contact cleaner over the sata connectors and put antistatic plastic below the hdd in the way to cheap metal sled…

i’ve been suffering for buying a cheap chassis … a lot
don’t worry it’s not micro arcing… it’s just corroding LOL
micro arcing in super low humidity would be scary tho, terrifying…

the disk is behaving flawlessly compare to previously with having 100ms peaks… behaving for now anyways… and the sounds are gone, and tho i can see a improvement, the fundamental problem i was trying to hunt down… didn’t get fixed…

but it’s always nice when there is progress… been tinkering with this problem on off for so long i still can’t believe its fixed… its been getting worse for the last couple of months.