Nodes getting stuck on update

jammerdan · June 24, 2024, 1:27pm

Of course it is stuck because the node is not running. Just check the timestamps.
These are the last log lines these nodes are currently showing.

snorkel · June 24, 2024, 1:33pm

I have the same log entries and the node is running. They tell you that: it downloads the version number (that formulation is somwhaow missleading); your version for storagenode and storagenode-updater; the new version is rolled out for both services but the pointer dosen’t reached your node id. It dosen’t tell you that the updated started for your node.
And it seems that your nodes are checking for new versions once per day. I remember that default was 1h. I set mine to 6h.

jammerdan · June 24, 2024, 1:35pm

That’s great for you. These nodes are not running.

snorkel · June 24, 2024, 1:36pm

Something else is wrong than.
What the dashboard says?
What log options do you use?
Even with log.level on fatal, these entries are still logged.

jammerdan · June 24, 2024, 1:42pm

Yes and I am telling you again that these nodes are not running.
What is so hard about that to understand?
These are the ends of the current logs. No activity after that, no more log lines.

If you compare, these are basically the same messages and the same situation that OP has reported.

striker43 · June 24, 2024, 1:53pm

Did a restart help?

Roxor · June 24, 2024, 1:53pm

Next time that happens, can you see if the inbound port is still open?

Most log entries are from inbound connections asking for something: and if inbound port-forwarding stops working… my guess is you may be left with those sparse log entries (from the few automated tasks a node decides to do for itself: like look for upgrades). So maybe when you’re restarting your nodes you’re also restarting a VPN connection?

I’m guessing…

jammerdan · June 24, 2024, 2:00pm

I am sure a restart will help.
But first I must fix something different, then I will try it.

snorkel · June 24, 2024, 2:01pm

I can’t suggest anything based on those log entries, because they don’t tell anything is wrong. And you don’t provide anything else. I don’t know your setup or config parameters. I can’t do the debug for you.
Reading above I see that someone found a database lock entrie. So there was the database problem.
Here, I just see normal log entries and the fact that you say nodes don’t work.
If Windows dosen’t log things like in linux, maybe the service realy gets stuck without any info to work with.
Try what others suggest, restart the machine, services, VPN, etc. and modify the parameters for storagenode service to try more than 1 time if dosen’t start. I know I had problems when I ran a win node 3 years ago, with service start, and modified the start/restart options like: try 5 restarts, 1 min apart or something.
For database lock and in general, moving databases on a SSD or USB 3 SSD stick helps everything.
Also Windows has a very complex logging system. See if you get some warnings or errors there, in computer management I believe.

jammerdan · June 24, 2024, 2:31pm

It is exactly the situation that OP has described: The node gets stuck and the last thing it logs are the log lines he resp. I have posted.
There is nothing else after that. Docker is saying the contaainer is still running. But it does not do anything.
It is not the first time this has happened and usually a restart brings the container back to a working state.

pangolin · June 24, 2024, 3:29pm

Since this problem seems to be related to auto update it might help setting AUTO_UPDATE to false.

snorkel · June 24, 2024, 4:11pm

How is this related to auto-update? No line says “update started” or something.

snorkel · June 24, 2024, 4:14pm

What log level do you run? Any custom level settings? I try to figure out if your log level bypasses the important entries.

pangolin · June 24, 2024, 4:16pm

Last line in log is coming from storagenode-updater and next missing line would be from storagenode-updater, so how could this be not related to auto-update?

JWvdV · June 24, 2024, 6:49pm

I don’t think this is a bug of the updater. This quite sure is a stopped node. Probably somewhere before in the log you got a timeout error. But in one or another way, the storagenode doesn’t restart always. The only thing working now is the updater, which is being restarted every now and then, as in the config.

Have had them on Linux as well. Had to do with slow file system.

jammerdan · June 24, 2024, 6:59pm

That is a great analysis of the situation. That it is not the updater that causes a problem it’s rather the only thing left that is still running and still logging stuff.

Ok then, the question would be, why did the storagenode not restart. But that is another story then.

JWvdV · June 24, 2024, 7:13pm

Stopped asking, because I got unreadable logs with timeout error some hours before this happened. See one of my latest topics.

Has probably to do with slow filesystem. Therefore I’m migrating to ZFS, in order to split metadata from the real data in order to speed up a few things.

The question whether the node doesn’t restart after timeout… Well… I don’t know. Actually don’t care either. Because if they did, they would be restarted at least one time a day. Which would also break ever filewalker. So, this problem needs to be solved at my side instead of STORJs side. Maybe the choice-of-best-n is going to solve some things.

Alexey · June 25, 2024, 6:55am

These logs doesn’t have entries from the storagenode process at all, this is usually mean that you redirected logs to the file. In this case the docker logs command will show logs only from the supervisor and storagenode-updater, but storagenode logs will be in the file.
So, please check your redirected file for errors why your node is crashed.

However, I would prefer to know, why it’s not restarted after…

Milo123459 · June 25, 2024, 7:46am

storagenode2-storagenode-1  | 2024-06-25T07:45:32Z    INFO     New version is being rolled out but hasn't made it to this node yet    {"Process": "storagenode-updater", "Service": "storagenode-updater"}

I keep getting this error on one of my two nodes. This makes the dashboard go down and the entire service go down. I don’t run watchtower as it didn’t work for some reason. How can I fix this?

Alexey · June 25, 2024, 7:49am

Do you have something in the storagenode logs?
If you redirected logs to the file, you need to check that file instead of docker logs storagenode.