During developing my updater, i found that some times node just not stopping.
Service is in Stopping condition but TCP listener receiving data and event after 10 min there is a bunch of connections opened.
HDD during this time literally do nothing.
When Iâve had a node have âtroublesâ and die, which happens often on my slow node (bad) with 1gb ram (bad) and NFS mounted shares (very bad), it also wonât be killable via a docker stop or kill command. (this is using linux/docker). It seems to happen when the node gets backed up and canât satisfy all the fire write requests and the node maxes out available ram.
Some quick thoughts:
Have the processes closed their *.dbâs file shm/wal files? Are any stalled reporting/tally scripts running?
Is there an open file handle somewhere on the .log or .db files?
Using badger cache database files? If anything has an open handle other than the respective service process, thatâd be problematic that it canât close them to exit.
Also, maybe double check all the current command line parameters in task manager for those processes, is there a lazy filewalker(s) stuck.
I take it the logs are showing âslow shutdownâ?
âŚand of course, what do logs say?
You need to check also for hardware issues, hanging on stop usually an indication of the hardware problem in a first place, like dying HDD/SSD/OS drive or problems with RAM modules. Sometimes it can be caused by a bad PSU (some CPU cores may even hang).
So, I would also recommend to take a look on that.
Can you terminate it from the Task Manager?
If not, then itâs definitely a hardware issue. In case of Windows it could be also a misbehaving driver, but I guess you updated them all.
I have only Warn log level, but no errors found in logs.
Task manager killed it immediately.
Interesting that problem is more than one nodes. and other nodes on same server working OK without problem. Node has also UPS. PSU is 850W it is enouth for power, Server is dedicated to Storj only. Connected by HBA sas 9300. DBs on Lexar 2 TB gen4 NVME. HDDs are different, part of WB perple, and even Toshiba 18TB Enterprice series. RAM is 4x16 ECC Registered memory, that is supported by motherboard. CPU is Xeon E5-2680
So it is server based platform.
I observed this problem before on different servers and different nodes may be since 1.10 or something. And I have some feeling that it happens with nodes that long time was not restarted. this server I not restarted since last update.
Interesting. This should affect more nodes, than only yours in that case.
But I didnât see other reports about this so far. So, something either new or specific to the local setup.
no, it consume as usual, than when node working normally, my consuming of ram more depends from ammount of opend tcp cannels, as every channel takes 1mb for buffer.
You need to open a Resources Monitor, find there the hanging process and take a look, what files itâs uses now. The other way is to use ProcessMonitor from Sysinternals: