Storagenode Memory Utilization Comparison

SGC · August 21, 2020, 8:09am

Not sure if a thread akin to this already exists…

In recent time i had a bit of issues with my node, and now that it’s basically back to normal, my memory usage is slightly on the higher end of what i have been use to… so wanted to get an idea if this is due to higher activity levels at the moment…

and to see how different hardware, OS’s and node sizes affect memory utilization.

and there is no significant IOwait… nothing i would consider significant anyways…

SGC · August 22, 2020, 8:10am

Up to 400 MByte today according to netdata

No significant IOwait, showing in proxmox

did a zpool iostats -w 600
> tank total_wait disk_wait syncq_wait asyncq_wait

latency      read  write   read  write   read  write   read  write  scrub   trim
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0      0      0      0      0      0      0
255ns           0      0      0      0      0      0      0      0      0      0
511ns           0      0      0      0      0      0      0      0      0      0
1us             0      0      0      0      0      0      0      0      0      0
2us             0      0      0      0      0      8      0      4      0      0
4us             0      0      0      0      1     32      0     23      0      0
8us             0      0      0      0      0      1      0      5      0      0
16us            0      0      0      0      0      0      0      1      0      0
32us            0      0      0      0      0      0      0      2      0      0
65us            0      1      0      1      0      0      0      3      0      0
131us           0     14      0     31      0      0      0      7      0      0
262us           0     40      0     85      0      0      0     15      0      0
524us           0     34      0     70      0      0      0     21      0      0
1ms             0     40      0     16      0      0      0     28      0      0
2ms             0     36      0      8      0      0      0     25      0      0
4ms             0     15      0      2      0      0      0     11      0      0
8ms             0     10      0      4      0      0      0      7      0      0
16ms            1     14      1      7      0      0      0     11      0      0
33ms            0     12      0      4      0      0      0      9      0      0
67ms            0      9      0      1      0      0      0      7      0      0
134ms           0      4      0      0      0      0      0      3      0      0
268ms           0      0      0      0      0      0      0      0      0      0
536ms           0      0      0      0      0      0      0      0      0      0
1s              0      0      0      0      0      0      0      0      0      0
2s              0      0      0      0      0      0      0      0      0      0
4s              0      0      0      0      0      0      0      0      0      0
8s              0      0      0      0      0      0      0      0      0      0
17s             0      0      0      0      0      0      0      0      0      0
34s             0      0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0      0
--------------------------------------------------------------------------------

got a few stragglers at the 134ms
i suppose those could be to blame… doesn’t seem high enough to make a significant impact
ofc it could be the reason for me seeing new numbers in the storagenode memory usage, or it’s something to do with the extra latency created on my now dual disk slog.

or the storagenode cache size change on a more long term scale and i’m still seeing some of the after effects of the issue i had a few days ago.

Don’t use dedup lol, i had decided to try out dedup on my VM dataset which shares the pool with the storagenode…

was a long time coming think i have it running for 3 weeks before the issues really started to show up, figured it was a good place to test because i knew i could reverse any ill effects but moving the fairly limited vm disks out of the pool and back in… tho didn’t seem like it was required.

SGC · August 27, 2020, 7:22pm

my storagenode memory usage has now finally returned to normal…
first time i’ve seen numbers this low in maybe a week now since i had all my latency issues.
anyways wanted to make a bit of a record so that others might use it later…

from what i can tell it can take like a week until node memory usage drops down to normal levels again… or atleast that was how it seemed to go in my case…

if this is my last addition then it will have remained low for an extended period…
might do a 1 or 3 month follow up…

figured it would make sense to have a place where SNO’s could get an idea of what memory utilization they should expect, i sure was happy that i had a good deal of spare memory since i peaked at about 3gb cache on my storagenode.

SGC · September 24, 2020, 4:07pm

@littleskunk riddle me this…

i’m pondering if this is normal peak ram usage, because it will spike in the GB range at times

and here is a comprehensive look at the system statistics

no current utilization or iowait

weekly iowait also looks fine… i forget what i was doing when it averaged a peak of 18% most likely a scrub of my pool so to be expected, and the start is the system boot, takes a few hours to warm up the arc

but like you can see no considerable iowait

24 hour proxmox cpu graph also looks fine in regard to iowait

was running a find command a little while back… caused the little spike at the end of the graph

still no iowait…

checking the logs with successrate.sh also looks fine

./successrate.sh storagenode-2020-09-23.log storagenode-2020-09-24.log
========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            2593
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                2
Fail Rate:             0.003%
Canceled:              65
Cancel Rate:           0.114%
Successful:            57139
Success Rate:          99.883%
========== UPLOAD =============
Rejected:              0
Acceptance Rate:       100.000%
---------- accepted -----------
Failed:                0
Fail Rate:             0.000%
Canceled:              9
Cancel Rate:           0.036%
Successful:            24657
Success Rate:          99.964%
========== REPAIR DOWNLOAD ====
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            20497
Success Rate:          100.000%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              1
Cancel Rate:           0.029%
Successful:            3432
Success Rate:          99.971%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            62558
Success Rate:          100.000%

i could try to go through individual disk latency, but can’t really see anything of note…

 pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 0 days 15:25:56 with 0 errors on Wed Sep 16 13:26:48 2020
config:

        NAME                                         STATE     READ WRITE CKSUM
        tank                                         ONLINE       0     0     0
          raidz1-0                                   ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR11021EH2JDXB  ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR11021EH21JAB  ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR31021EH1P62C  ONLINE       0     0     0
          raidz1-2                                   ONLINE       0     0     0
            ata-TOSHIBA_DT01ACA300_531RH5DGS         ONLINE       0     0     0
            ata-TOSHIBA_DT01ACA300_99PGNAYCS         ONLINE       0     0     0
            ata-TOSHIBA_DT01ACA300_Z252JW8AS         ONLINE       0     0     0
          raidz1-3                                   ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR31051EJS7UEJ  ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR31051EJSAY0J  ONLINE       0     0     0
            ata-TOSHIBA_DT01ACA300_99QJHASCS         ONLINE       0     0     0
        logs
          fioa2                                      ONLINE       0     0     0
        cache
          fioa1                                      ONLINE       0     0     0

errors: No known data errors

hmmm looks like the first spike was the scrub on the 16th and i cannot remember what i did to cause the other one… doubt it’s relevant… tho last time it took days before my “recorded” storagenode memory usage dropped down to its usual of 70-90mb …

i suppose it could still be the spike from the 20th…

guess ill just have to wait and see if it comes back in the future… ah the 20th is the boot … lol it was reversed, until the arc and l2arc takes over the io load on the pool is a bit heavy…
and ofc the storagenode also boots right with the server…

just added a new PCIe SSD specifically to try and limit my iowait because my old setup of dual sata SSD’s got overworked so bad that ended up with 120ms latency which then affected the hdd’s latency.

but ever since i got that working 4 days ago now, my numbers have been great… ofc haven’t really put some serious load on the system yet… just running 2-3 vm’s but have tested upto 9 without it showing any issues aside from that i don’t have enough ram

i just think it’s weird that i got 1.4gb memory allocated to the storagenode… i need to get my netdata fixed and get my storagenode moved into a container so i can better monitor the memory utilization over weeks and months… netdata is pretty crappy for long term numbers… but nice for the gritty details, when it wants to work…

alas so what is my storagenode memory usage suppose to be and is that stable or changes widely i guess my question is… does it increase with node size and activity level… i suppose it would…

the system has 48gb … so not really a concern that it uses a bit of extra ram now and then, and tho netdata says 90% used, then its most like 85% if you ask proxmox and then the ARC is 21 GB of that which the system will drop if required immediately if something requests more memory than is free…
so not like the storagenode can chew through it quickly… especially without any noticeable ingress

SGC · September 24, 2020, 5:08pm

but ill wait and see if its an artifact / delayed reaction to the boot iowait peak, i suppose that it says 1.4gb doesn’t mean it’s using 1.4 gb but just that its allocated for the node