My node was causing a memory leak and ate up all available memory, thus terminating services

litori · July 20, 2020, 10:38pm

Not sure what happened but my node: 1MhTUzg1fzdoZBAU1MNveQq1WpL7FyfQ3X5MDKBxsuXHUaJrba was causing a memory leak and ate up all available memory, thus terminating services. Had to stop it and restart the node.

Gradual memory leak over the span of 3 hours.

Alexey · July 20, 2020, 11:23pm

Please, check your databases:

Also, make sure that you do not use the network connected drive as your storage.

litori · July 20, 2020, 11:36pm

It is not a network drive. I PWD into the database directory and tried to run the docker command. Also tried replacing ${PWD} with direct path, error below.

ash-4.3# docker run --rm -it --mount type=bind,source=${PWD},destination=/data sstc/sqlite3 find . -iname *.db -maxdepth 1 -print0 -exec sqlite3 '{}' 'PRAGMA INTEGRITY_CHECK;' ';'
find: unrecognized: heldamount.db
BusyBox v1.31.1 () multi-call binary.

Usage: find [-HL] [PATH]... [OPTIONS] [ACTIONS]

Search for files and perform actions on them.
First failed action stops processing of current file.
Defaults: PATH is current directory, action is '-print'

        -L,-follow      Follow symlinks
        -H              ...on command line only
        -xdev           Don't descend directories on other filesystems
        -maxdepth N     Descend at most N levels. -maxdepth 0 applies
                        actions to command line arguments only
        -mindepth N     Don't act on first N levels
        -depth          Act on directory *after* traversing it

Actions:
        ( ACTIONS )     Group actions for -o / -a
        ! ACT           Invert ACT's success/failure
        ACT1 [-a] ACT2  If ACT1 fails, stop, else do ACT2
        ACT1 -o ACT2    If ACT1 succeeds, stop, else do ACT2
                        Note: -a has higher priority than -o
        -name PATTERN   Match file name (w/o directory name) to PATTERN
        -iname PATTERN  Case insensitive -name
        -path PATTERN   Match path to PATTERN
        -ipath PATTERN  Case insensitive -path
        -regex PATTERN  Match path to regex PATTERN
        -type X         File type is X (one of: f,d,l,b,c,s,p)
        -executable     File is executable
        -perm MASK      At least one mask bit (+MASK), all bits (-MASK),
                        or exactly MASK bits are set in file's mode
        -mtime DAYS     mtime is greater than (+N), less than (-N),
                        or exactly N days in the past
        -mmin MINS      mtime is greater than (+N), less than (-N),
                        or exactly N minutes in the past
        -newer FILE     mtime is more recent than FILE's
        -inum N         File has inode number N
        -user NAME/ID   File is owned by given user
        -group NAME/ID  File is owned by given group
        -size N[bck]    File size is N (c:bytes,k:kbytes,b:512 bytes(def.))
                        +/-N: file size is bigger/smaller than N
        -links N        Number of links is greater than (+N), less than (-N),
                        or exactly N
        -prune          If current file is directory, don't descend into it
If none of the following actions is specified, -print is assumed
        -print          Print file name
        -print0         Print file name, NUL terminated
        -exec CMD ARG ; Run CMD with all instances of {} replaced by
                        file name. Fails if CMD exits with nonzero
        -exec CMD ARG + Run CMD with {} replaced by list of file names
        -delete         Delete current file/directory. Turns on -depth option
        -quit           Exit
failed to resize tty, using default size

Had to run it seperately.

ash-4.3# sqlite3 bandwidth.db "PRAGMA integrity_check;"
ok
ash-4.3# sqlite3 heldamount.db "PRAGMA integrity_check;"
ok
ash-4.3# sqlite3 notifications.db "PRAGMA integrity_check;"
ok
ash-4.3# sqlite3 orders.db "PRAGMA integrity_check;"
ok
ash-4.3# sqlite3 piece_expiration.db "PRAGMA integrity_check;"
ok
ash-4.3# sqlite3 pieceinfo.db "PRAGMA integrity_check;"
ok
ash-4.3# sqlite3 piece_spaced_used.db "PRAGMA integrity_check;"
ok
ash-4.3# sqlite3 pricing.db "PRAGMA integrity_check;"
ok
ash-4.3# sqlite3 reputation.db "PRAGMA integrity_check;"
ok
ash-4.3# sqlite3 satellites.db "PRAGMA integrity_check;"
ok
ash-4.3# sqlite3 storage_usage.db "PRAGMA integrity_check;"
ok
ash-4.3# sqlite3 used_serial.db "PRAGMA integrity_check;"
ok

anon27637763 · July 20, 2020, 11:50pm

sudo free -h

Most of that used memory is probably cached/buffers. This is likely due to the large amount of disk I/O… I noticed it too.

# free -h
              total        used        free      shared  buff/cache   available 
Mem:           62Gi       2.5Gi       324Mi        46Mi        60Gi        59Gi

There’s probably a bug somewhere that forgets to free up used buffers after writing to disk.

litori · July 20, 2020, 11:52pm

It was not buffer/cache. It was actual memory used. Processes started killing themselves and I had to stop the storagenode and restart it.

SGC · July 21, 2020, 8:20am

i’m running the docker version and i’ve never seen a hit of a memory leak, in fact i’ve been very impressed by it’s ability to keep memory usage of the storagenode at about 100mb.
and the node has been up continuously for two weeks now.

from what i understand also, tho haven’t really had problem’s with it myself, the memory usage can be related to write operations to disk… if we say your hdd cannot keep up with the iops… doesn’t have to be much… like say it can take 200iops and the storagenode needs 202
so in like 2 minutes it’s almost 1 second behind… and so if we get 1.5mb ingress a sec… then no big deal…

so lets just call it 1-1 mb to minutes… so 1440 minutes in a day… and lets say your machine has atleast 8gb… which gives you maybe 6 days before it would completely saturate your RAM
ofc most system will then try to swap, which easily could give you another big chunk…
i would suppose it could also damage the node if the system crashed …

anyways my point is that even tho your disk seems to be able to keep up, it might not be so… and it can cause what looks like a memory leak…

one easy patch would be to set the node to restart once in a while…

excuse the hole in the middle… no clue what this is about… netdata is kinda rough around the edges, maybe it crashed or something…
and really should get it correctly configured so i can use the graphs better…
anyways thats my memory usages… seems like the is basically no memory activity in the storagenode and the main part of the memory is used for rss

been checking it from time to time over the last few months… never seen it above 250mb… and that was only once i think when we had 5mb/s peak ingress