Please see my previous post, I indicated non-critical errors with red arrows.
Ah, I see that now, interesting. I don’t have color
set in my config.yaml and I’ve set log.level: info
. Upon upgrading to 1.19.6, I also saw the Invalid configuration value for key
error for log.level
but that has gone away with the upgrade to 1.20.2. What is log.level
set to for you?
I have log.level info
too
translated 8 nodes to work without docker. works without problems and has been updated from version 1.16.1.
tonight when updating to version 1.21, all files were downloaded without problems, but the node did not start after that. there are no errors in the logs, just the node did not start.
Do you have any information from the updater log?
@littleskunk I can confirm, service storagenode is not started after update:
Here is the updater log, as you can see updater is restarted storagenode service. Also, we have some new “Invalid configuration options”
Here is storagenode log:
So, here is no restart, it is stop, please make sure that the updater is restating storage node instead of stopping.
I looked into storj/restart_linux.go at 89e682b4d73dc2b1c4623e174095d3d441ceb1b6 · storj/storj · GitHub
And see only “stop” on restart function:
ADD:
Based on the latest restart issue, I would like to propose changes for services:
Now we have:
[Unit]
Description = Storage Node service
After = syslog.target network.target
[Service]
Type = simple
User = storj-storagenode
Group = storj-storagenode
ExecStart = /opt/storagenode/bin/storagenode run --config-dir "/etc/storagenode/config"
Restart = on-failure
NotifyAccess = main
[Install]
Alias = storagenode
WantedBy = multi-user.target
After change:
[Unit]
Description = Storage Node service
After = syslog.target network.target
[Service]
Type = simple
User = storj-storagenode
Group = storj-storagenode
ExecStart = /opt/storagenode/bin/storagenode run --config-dir "/etc/storagenode/config"
# Give a reasonable amount of time for the server to start up/shut down
TimeoutSec = 300
RestartSec = 30
#Restart = on-failure
Restart = always
NotifyAccess = main
[Install]
Alias = storagenode
WantedBy = multi-user.target
Description for new parameters is here
TimeoutSec = 300
RestartSec = 30
Restart = always
The same parameters can be applied for storagenode-updater service.
when updating to a new version 1.21.2, the nodes did not start again. I’ll try to make changes to the service settings.
@littleskunk I pay your attention again, the storage node updater is not starting storagenode service after an update.
Here is an example:
As you can see, the service is starting after 30sec. because I applied a workaround for service
The root cause is simple: updater service is killing himself, end exited with exit code 1 (failure), after exit service is restarting updater service because service is failed (exit code 1).
Storage node service is stopping by updater service with exit code 0 (success), after exit service do nothing (stay stopped), because we have Restart = on-failure
.
Solution: please add start storage node service to restart function on updater service.
Could you please confirm this issue?
Thank you very much for the heads-up. I’ll be sure to keep an eye out when my storage nodes update.
I can confirm this issue, i had it today, which caused a ~8 Hour Downtime on one of my nodes.
Service started downloading, seems to have stoppend the main node-service, then failed, got restarted by systemd, but the main node-service was not, because the updater-service failed.
No Idea what caused it though.
I changed unitfiles of both services to restart = always as a workaround.
I can confirm that one of my nodes failed to restart after updating to 1.21.2
Same here, yes. It failed to start after the upgrade.
Thank you for reporting such issue. We are going to update the documentation to instruct users to set their configuration to Restart=always
.
what’s the link to the latest documentation please?
I would pay more attention to a potentially critical bug, abnormal memory consumption of “systemd-journal”:
Yesterday:
Today:
I still working on determining the root cause and preventing it, will post a solution soon ™
PS. you can check memory consumption on your side with:
ps -A --sort -rss -o comm,pmem,rss | head -n 20
PSA: the v1.22.2 ARM binary zip file appears to have two files in it, causing the following error during the update.
2021-02-17T14:25:53.336-0700 ERROR Error updating service. {"Service": "storagenode", "error": "archive should contain only one file", "errorVerbose": "archive should contain only one file\n\tmain.unpackBinary:94\n\tmain.downloadBinary:61\n\tmain.update:44\n\tmain.loopFunc:
26\n\tstorj.io/common/sync2.(*Cycle).Run:152\n\tmain.cmdRun:126\n\tstorj.io/private/process.cleanup.func1.4:363\n\tstorj.io/private/process.cleanup.func1:381\n\tgithub.com/spf13/cobra.(*Command).execute:842\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:950\n\tgithub.com/spf13/cobra.(*Command).Execute:887\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/
private/process.Exec:65\n\tmain.main:11\n\truntime.main:204"}
The zip file for the AMD64 binary does not have this issue:
The __MACOSX
folder is likely tripping up the updater service.
I’ve manually updated my ARM node for this version.
@littleskunk anyone know if there’s an installer yet? Sorry to sound like a broken record but ive finally got my internet connection at the new house and I’m ready to setup all my nodes again after running GE last year.
Also, if there’s no installer, should I setup my new production nodes using this? Is this production ready?
Any issues playing it safe and setting my new nodes up with docker? Will the new Multi-node Dashboard work with docker setup?
Cheers
@will.topping, for what it’s worth, I’ve been using this since v1.15.3 and it’s been equally stable as the Docker nodes. I know a .deb
or something is preferable, but the installation isn’t too onerous. As linked by @ACarneiro last month, I’ll pat myself on the back for the following instructions. A couple binary downloads, creating a dedicated user, and a couple systemd service files and it’ll be up and running. Perhaps something to hold you over until an installer is released.