v1.50.3 is ready for the storage node rollout. This time we have the storage node updater inside docker to test. The risks are high. Worst case docker nodes will not startup or will get stuck on updating to the following release. We need your help to mitigate these risks on as many test machines as possible. Please help us.
The new docker image is already published but not tagged as the latest. So your existing docker node will not install it yet. We would like to run 2 tests with your help.
For the first test round, please set up a test node and join the test network: Please join our public test network
Later this week we are going to deploy v1.51.0-rc on the test network. Hopefully, your test node will update just fine.
After the first test round, there is still the risk that we crash production. So in a second test round, we would like to update as many mainnet nodes as possible before the docker image gets tagged as the latest.
The last step after we finished both tests rounds will be a cleanup. For both test rounds, you will have to specify a specific docker image 5f2777af9-v1.50.4-go1.17.5. At the end please reconfigure your system to switch back to the latest docker image.
Not a great start on my test node. Swapped out the image and it got stuck in restarting with the following repeated logs.
I also noticed that it’s connecting to the production version server instead of test for the initial download of the updater. Should we be setting VERSION_SERVER_URL in the run command? It seems to default to production in the docker file atm.
I’ll revert for now.
Let me provide some context. This is on a Synology DS3617xs, which is amd64 architecture. The only thing I changed in my (working) run command was the image tag.
Getting the same result. Should I be using a new image?
Also, this seems like a pretty big flaw in the system. If for some reason the url isn’t reachable or the file doesn’t exist, recreating the container would simply make it impossible for the node to start again. Could a fallback be built in that keeps a copy of the last working binaries in the storage location and reverts to them in case something like that happens? Otherwise I’m really not feeling good about this approach.
I also agree with @CutieePie that this isn’t really how docker should work. You already use a custom version of watchtower, wouldn’t it be possible to make watchtower aware of the latest version have it use the rollout system to determine when to trigger automated updates of the node? I’m aware that this wouldn’t block SNOs from updating earlier, but how many would really do that manually? And if they do, how likely is it that they will do that very quickly?
I just really don’t like that I could end up with a container that is unable to even download the binary.
Ok great. I did the second mistake as well and forgot to publish the v1.50.3 release. Now that is also fixed and I see no more reason why it shouldn’t find the binaries now.
There is another option for that which I already set. But the entrypoint script uses the environment variable before the binaries ever run. So for now you have to set it twice. I would suggest using the environment variable for both. But as you can see from the log I posted, setting the environment variable made it use the QA server. But it still didn’t work.
The issue linked above implies that this fixes a problem of being unable to push a fix for bad code to docker containers. Does that mean this updater will only be used in rare cases of known bad code as opposed to a general updater? I would consider this a much better idea than having a self-updating docker container. That would at a minimum increase the difficulty of downgrading.
I’ve been trying to reproduce the issue for hours but still unable to reproduce; at least on docker desktop for Mac.
I’m still unable to reproduce this with both test and production version server URL:
downloading storagenode-updater
Connecting to version.qa.storj.io (35.188.169.133:443)
writing to stdout
- 100% |********************************| 92 0:00:00 ETA
written to stdout
Connecting to github.com (140.82.121.3:443)
Connecting to objects.githubusercontent.com (185.199.111.133:443)
saving to '/tmp/storagenode-updater.zip'
storagenode-updater. 100% |********************************| 7910k 0:00:00 ETA
...