Node does not stop and spam in the logs for gigabytes

Krey · September 17, 2021, 1:42am

When I stop two of my nodes (version 1.35.3, not latest but still actual between 1.24.0 and 1.37.1)
they do not stop, but start writing to the logs at high speed, something like this

logs reach the next size quickly enough

-rw-r--r-- 1 root root 4.2G Sep 17 02:05 node02.log
-rw-r--r-- 1 root root 4.3G Sep 17 02:04 node03.log

I had reason to believe that the problem could be in corrupted databases.

I run on them PRAGMA integrity_check and vacuum. Everything went without problems, launched, the log was perfect, waited, stopped, the problem was reproduced

I erased the all dbs, started the nodes without a database, the database was created successfully, the log was perfect, stopped, the problem was reproduced

I am wrote this text during a break while the remote machine is being soft rebooted and trying to stop the nodes processes.

…Next i update storagenode binary, erase DBs, problem was reproduced

I am wrote this post because it seems to me that the problem maybe potentially serious for entire network . I do not see my own mistakes that could lead to such very serious consequences.

Just pay attention to it.
If the problem is mine, and I lose these (my very first and largest nodes)

Summary

no problem, I do not ask to restore them.

Affected nodes:
node02 - 1M1zL3YVVAsdLtvcTbpxhTahmp9jrhiw3GsQwAmukSsnLmWv7U
node03 - 1boQ78NbzEmNq59UwABfYJH1TzjBG7TvnVqjz5ho95FcdchAww

this is a short version

in long version timings is

Summary

Alexey · September 17, 2021, 4:07am

Never saw such behavior on none of the versions. This problem is look very suspicious:

Looks like underlaying OS issue. Can you replace/reinstall underlaying OS?
The light version is to try to use docker.
If that not possible, please create a bug there: Issues · storj/storj · GitHub

Krey · September 17, 2021, 5:48pm

Process receive sigterm but in infinite looop spam logs until i send sigabrt
But my operating system is to blame. Where is the logic here?

Alexey · September 17, 2021, 5:49pm

Please, can you try at least docker version?

Krey · September 17, 2021, 5:50pm

I use Linux binary. Where is no docker.

Alexey · September 17, 2021, 5:51pm

I understand that from your first post. I just trying to troubleshoot. The docker is a good way to isolate a process.

Krey · September 21, 2021, 10:45pm

this is the end of the log line of GiB+ if you wait a long time

I think it’s worth dividing the problem into several parts

The error is in the storagenode process which is the root of the problem. It is most likely associated with the slowdown of the disks on which the database is located and with the general disregard this soft on system resources in moments of high traffic load or a ripper.
As a result of this error, other error generate error message for gigabytes.
Whole this line firstly written to memory if there is one (I have a lot of it)
By system process this long line written from memory to a log file on disk.
That is why writings continues for a long time after the killing of the storagenode processes and slow down reboot.

Please solve a simple problem first, limit the log entry to a couple of hundred characters. I think many SNOs and entire network feel better about this.

@Alexey Sorry i have no extra time to formulate the bug in github.

Alexey · September 22, 2021, 8:23am

I can report a bug on your behalf, but since no one else has this problem, our developers will probably ask questions and I won’t have answers.
I even cannot reproduce it, would you mind to share information - how your disk is connected to make it so slow?

Krey · September 23, 2021, 6:00pm

zfs raidz1 (3+1) for data + mirror on ssd for special device
db and orders moved to other partion on ssd
this worked two ears+

now one of ssd are fail. I can’t say it broke because of this error or the error began to arise due to the fact that it broke. At the moment the bases are returned to the main array (actually was deleted and storagenode recreate it) and bad ssd removed. Problem still preproduces some hours from node start.

Alexey · September 23, 2021, 7:52pm

I hope you have not lost anything important.

Krey · October 21, 2021, 6:59pm

everything is the same for me.
the time when these nodes go into such a plug varies greatly.
I can give a shell, dig deeper yourself?

why don’t you just limit the log line by the length as I advised?

all this memory occupied with log from one node

Alexey · October 21, 2021, 10:23pm

Please, create an issue on GitHub or make a pull request with needed changes. All contributions are welcome!

We have no one such reports so far from thousands operators, doesn’t look like a code issue, I suspect the local (specifically - hardware/OS) issue.