How read docker stats ? Node reboot when CPU and MEM are full

brizio71 · April 17, 2020, 11:53am

Hi,

this is my stats on docker:

CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
236830d29c6f grafana 0.71% 14.43MiB / 7.789GiB 0.18% 3.67MB / 5.06MB 0B / 0B 20
b0bb946dda59 telegraf 0.05% 13.2MiB / 7.789GiB 0.17% 635kB / 2.12MB 0B / 0B 17
17b1386b6f0a influxdb 0.07% 72.34MiB / 7.789GiB 0.91% 8.02MB / 3.48MB 0B / 0B 14
12363a6c71fa storagenode 43.89% 7.201GiB / 7.789GiB 92.45% 1.52GB / 1.09GB 7.24GB / 2.57GB 41
c65a89802209 watchtower 0.00% 1.645MiB / 7.789GiB 0.02% 54.7kB / 0B 0B / 0B 11

Is it correct to have this high value on MEM USAGE, NET I/O and BLOCK I/O ?

Looking the CPU and MEM usage from Grafana I saw that periodically about 15 minutes every hours a full uso of all CPUs and MEM ans SWAP MEM, sometimes during this period my node restart

and checking the docker stat I get this error:

fabrizio@storenod : ~$ docker stats
fatal error: runtime: out of memory
runtime stack:
runtime.throw(0x55990ed4d8fc, 0x16)
/usr/local/go/src/runtime/panic.go:617 +0x74 fp=0x7ffdcb7b55a0 sp=0x7ffdcb7b5570 pc=0x55990d76aaa4
runtime.sysMap(0xc000000000, 0x4000000, 0x559910d00ff8)
/usr/local/go/src/runtime/mem_linux.go:170 +0xc9 fp=0x7ffdcb7b55e0 sp=0x7ffdcb7b55a0 pc=0x55990d755db9
runtime.(*mheap).sysAlloc(0x559910ce7ae0, 0x2000, 0x559910ce7af0, 0x1)
/usr/local/go/src/runtime/malloc.go:633 +0x1cf fp=0x7ffdcb7b5688 sp=0x7ffdcb7b55e0 pc=0x55990d748bcf
runtime.(*mheap).grow(0x559910ce7ae0, 0x1, 0x0)
/usr/local/go/src/runtime/mheap.go:1222 +0x44 fp=0x7ffdcb7b56e0 sp=0x7ffdcb7b5688 pc=0x55990d7631c4
runtime.(*mheap).allocSpanLocked(0x559910ce7ae0, 0x1, 0x559910d01008, 0x0)
/usr/local/go/src/runtime/mheap.go:1150 +0x381 fp=0x7ffdcb7b5718 sp=0x7ffdcb7b56e0 pc=0x55990d7630b1
runtime.(*mheap).alloc_m(0x559910ce7ae0, 0x1, 0x2a, 0x6e43a318)
/usr/local/go/src/runtime/mheap.go:977 +0xc6 fp=0x7ffdcb7b5768 sp=0x7ffdcb7b5718 pc=0x55990d762706
runtime.(*mheap).alloc.func1()
/usr/local/go/src/runtime/mheap.go:1048 +0x4e fp=0x7ffdcb7b57a0 sp=0x7ffdcb7b5768 pc=0x55990d79382e
runtime.(*mheap).alloc(0x559910ce7ae0, 0x1, 0x55990d01002a, 0x7ffdcb7b5840)
/usr/local/go/src/runtime/mheap.go:1047 +0x8c fp=0x7ffdcb7b57f0 sp=0x7ffdcb7b57a0 pc=0x55990d7629dc
runtime.(*mcentral).grow(0x559910ce88e0, 0x0)
/usr/local/go/src/runtime/mcentral.go:256 +0x97 fp=0x7ffdcb7b5838 sp=0x7ffdcb7b57f0 pc=0x55990d755837
runtime.(*mcentral).cacheSpan(0x559910ce88e0, 0x7f0eb5ac4000)
/usr/local/go/src/runtime/mcentral.go:106 +0x301 fp=0x7ffdcb7b5898 sp=0x7ffdcb7b5838 pc=0x55990d755341
runtime.(*mcache).refill(0x7f0eb5ac4008, 0x2a)
/usr/local/go/src/runtime/mcache.go:135 +0x88 fp=0x7ffdcb7b58b8 sp=0x7ffdcb7b5898 pc=0x55990d754dd8
runtime.(*mcache).nextFree(0x7f0eb5ac4008, 0x559910cdd92a, 0x7f0eb5ac4008, 0x7f0eb5ac4000, 0x8)
/usr/local/go/src/runtime/malloc.go:786 +0x8a fp=0x7ffdcb7b58f0 sp=0x7ffdcb7b58b8 pc=0x55990d74940a
runtime.mallocgc(0x180, 0x55990fa45a40, 0x1, 0x559910d01060)
/usr/local/go/src/runtime/malloc.go:939 +0x780 fp=0x7ffdcb7b5990 sp=0x7ffdcb7b58f0 pc=0x55990d749d40
runtime.newobject(0x55990fa45a40, 0x4000)
/usr/local/go/src/runtime/malloc.go:1068 +0x3a fp=0x7ffdcb7b59c0 sp=0x7ffdcb7b5990 pc=0x55990d74a14a
runtime.malg(0x644dc00008000, 0x559910cea150)
/usr/local/go/src/runtime/proc.go:3220 +0x33 fp=0x7ffdcb7b5a00 sp=0x7ffdcb7b59c0 pc=0x55990d773f53
runtime.mpreinit(…)
/usr/local/go/src/runtime/os_linux.go:311
runtime.mcommoninit(0x559910ce1da0)
/usr/local/go/src/runtime/proc.go:618 +0xc6 fp=0x7ffdcb7b5a38 sp=0x7ffdcb7b5a00 pc=0x55990d76d8c6
runtime.schedinit()
/usr/local/go/src/runtime/proc.go:540 +0x78 fp=0x7ffdcb7b5a90 sp=0x7ffdcb7b5a38 pc=0x55990d76d558
runtime.rt0_go(0x7ffdcb7b5b98, 0x2, 0x7ffdcb7b5b98, 0x0, 0x7f0eb50f6b97, 0x2, 0x7ffdcb7b5b98, 0x200008000, 0x55990d7958a0, 0x0, …)
/usr/local/go/src/runtime/asm_amd64.s:195 +0x11e fp=0x7ffdcb7b5a98 sp=0x7ffdcb7b5a90 pc=0x55990d7959ce

SGC · April 17, 2020, 1:36pm

just checked my states i knew they where low… but it never fails to impress me.

i got 5% cpu peaks on my 16 threads at 2.5Ghz the avg is like 1.4% or less the ram usage is 150mb and slowly climbing over the last couple of days… but can’t say i’ve seen it much higher … maybe 250mb
have run my node for 14 days straight without a reboot without issues, but if usage continues to climb then i guess one would run into problems eventually… tho it seems so low and so often there is new updates that it shouldn’t be possible to run into issues with that…

are you sure the problem isn’t in the way you run your VMs / containers / docker or what not
anyways… old Intel xeon on linux and no issues here…

SGC · April 17, 2020, 1:38pm

sorry didn’t look closely at the images you posted…

its not cpu or memory … its iowait… the cpu spike is an artifact of the cpu waiting for the hdd’s

your drives cannot keep up… or that would be my best guess… its a usual issue…
you might be using SMR drives, or something like that…

brizio71 · April 17, 2020, 1:53pm

I’m using ISCSI to connect my storagenode from QNAP over 1GB network connection.
I never have this problem (I’m on Storj since Test Group B), all begin about 1 months ago.

Why the memory is satured if is it a storage problem ?

SGC · April 17, 2020, 2:16pm

that i cannot answer, but the iowait stat means how much its waiting for the harddrives…
i guess if it is waiting for harddrives to respond, the ram would buffer the incoming data…

do some benchmarks check that everything seems to be working within normal parameters… you could have a bad drive messing up the array…

its not always a bad thing that arrays start acting up… it might mean that your redundancy kicked in and you are now basically running without redundancy or limited… not sure if the developers purposely put that into the designs of the arrays… so its like broadcasting that it’s broken with having bad bandwidth or high latency… that way everybody knows it’s needs tending, before any data is actually lost.

Alexey · April 17, 2020, 11:20pm

As far as I can see, you have a lot of problems with iSCSI connected drives: https://forum.storj.io/search?expanded=true&q=%40brizio71%20%23sno-category%3Asno-troubleshoot%20%20

Did you have a consideration to move your node directly on your QNAP and connect data directly to the container without iSCSI? I believe you will solve 99.9% of current problems.

brizio71 · May 5, 2020, 2:46pm

Hi,

I have rsync all my data from ISCSI to directly connected Disks and the situation is about the same, the switch between ISCI to direct connected disk was at 13.00, the RAM usage is getting worst:

I didn’t any improve on CPU usage too:

and this is the IO wait:

What I can do to solve this problem ?

thx

cdhowie · May 5, 2020, 3:11pm

Note that iowait can also indicate that the system is swapping memory, so it can point to a memory issue as well.

brizio71 · May 5, 2020, 3:20pm

I have try to move the VM to a different hardware with same result

Alexey · May 5, 2020, 10:26pm

Could it be possible that you have a SMR drive?
The IOWAIT is extremely high.

brizio71 · May 5, 2020, 10:54pm

No, I have check and on ISCSI QNAP I have 4 disks WD RED PRO WD4001FFSX-68JNUN0 in RAID0
and on the other QNAP that I’m using now with direct connection I have 3 disks WD RED WD60EFRX-68MYMN1 IN RAID0 and all of them are not SRM drive

Alexey · May 5, 2020, 11:00pm

That’s a bad idea. With one disk failure the whole array (and node) is lost.
However, it should not have such impact on performance.

I saw you mention the VM. Please, clarify - do you run the VM on your QNAP instead of using a docker directly on your QNAP?

brizio71 · May 6, 2020, 5:31am

Yes I run Ubuntu VM becouse storagenode over qnap container doesn t work correctly

Alexey · May 6, 2020, 9:32pm

Could you elaborate?