Node not stopping some times

Vadim · September 24, 2024, 5:45pm

During developing my updater, i found that some times node just not stopping.
Service is in Stopping condition but TCP listener receiving data and event after 10 min there is a bunch of connections opened.
HDD during this time literally do nothing.

As you can see on screenshot it receives connection but not serving it.

kocoten1992 · September 24, 2024, 5:59pm

Does it work now? Could you check if it is downloading new binary storagenode-docker/docker/entrypoint at 2159415f2e1848666c2f853e6f440bf4dc879295 · storj/storagenode-docker · GitHub, I’ve seen it stuck for more than 30 minutes when connection between github have issue…

Vadim · September 24, 2024, 6:02pm

it is not storagenode updater it my updater, it not a problem of updater it node not stopping.
also it is windows service not docker

EasyRhino · September 24, 2024, 7:45pm

When I’ve had a node have “troubles” and die, which happens often on my slow node (bad) with 1gb ram (bad) and NFS mounted shares (very bad), it also won’t be killable via a docker stop or kill command. (this is using linux/docker). It seems to happen when the node gets backed up and can’t satisfy all the fire write requests and the node maxes out available ram.

Vadim · September 24, 2024, 8:04pm

System has 14 cores, 64 GB RAM, so it has enough resources.
Databases on NVME
ALso HDD do nothing, so it not resources problem.

Julio · September 25, 2024, 12:44am

Some quick thoughts:
Have the processes closed their *.db’s file shm/wal files? Are any stalled reporting/tally scripts running?
Is there an open file handle somewhere on the .log or .db files?
Using badger cache database files? If anything has an open handle other than the respective service process, that’d be problematic that it can’t close them to exit.

Also, maybe double check all the current command line parameters in task manager for those processes, is there a lazy filewalker(s) stuck.

I take it the logs are showing ‘slow shutdown’?
…and of course, what do logs say?

1/2 cent

Alexey · September 25, 2024, 5:43am

You need to check also for hardware issues, hanging on stop usually an indication of the hardware problem in a first place, like dying HDD/SSD/OS drive or problems with RAM modules. Sometimes it can be caused by a bad PSU (some CPU cores may even hang).
So, I would also recommend to take a look on that.
Can you terminate it from the Task Manager?
If not, then it’s definitely a hardware issue. In case of Windows it could be also a misbehaving driver, but I guess you updated them all.

Vadim · September 25, 2024, 6:04am

I have only Warn log level, but no errors found in logs.
Task manager killed it immediately.
Interesting that problem is more than one nodes. and other nodes on same server working OK without problem. Node has also UPS. PSU is 850W it is enouth for power, Server is dedicated to Storj only. Connected by HBA sas 9300. DBs on Lexar 2 TB gen4 NVME. HDDs are different, part of WB perple, and even Toshiba 18TB Enterprice series. RAM is 4x16 ECC Registered memory, that is supported by motherboard. CPU is Xeon E5-2680
So it is server based platform.

Alexey · September 25, 2024, 7:14am

We received such reports usually when the hardware has had some issues so far.
Seems we need to monitor longer.

Maybe it’s a reason for the badger cache corruption during reboot too.

Vadim · September 25, 2024, 8:53am

I observed this problem before on different servers and different nodes may be since 1.10 or something. And I have some feeling that it happens with nodes that long time was not restarted. this server I not restarted since last update.

Alexey · September 25, 2024, 8:55am

Interesting. This should affect more nodes, than only yours in that case.
But I didn’t see other reports about this so far. So, something either new or specific to the local setup.

EasyRhino · September 25, 2024, 3:39pm

when this happens, can you tell if the storj software is consuming more RAM than usual? (since it’s windows, task manager I guess?)

Vadim · September 25, 2024, 3:58pm

no, it consume as usual, than when node working normally, my consuming of ram more depends from ammount of opend tcp cannels, as every channel takes 1mb for buffer.

Alexey · September 26, 2024, 7:52am

Could you please check, what files it locks?

Vadim · September 26, 2024, 8:20am

what do you mean by that? database files?

Alexey · September 26, 2024, 8:21am

Any files, to at least have a guess of where to look.

Vadim · September 26, 2024, 6:07pm

zdes vopros pro kakie faely on blokiruet?
pro kakie konkretno faely idjot rech?
logi, bazy dannyh?

Alexey · September 27, 2024, 5:37am

You need to open a Resources Monitor, find there the hanging process and take a look, what files it’s uses now. The other way is to use ProcessMonitor from Sysinternals: