[Tech Preview] Storagenode Updater inside Docker

v1.50.3 is ready for the storage node rollout. This time we have the storage node updater inside docker to test. The risks are high. Worst case docker nodes will not startup or will get stuck on updating to the following release. We need your help to mitigate these risks on as many test machines as possible. Please help us.

The new docker image is already published but not tagged as the latest. So your existing docker node will not install it yet. We would like to run 2 tests with your help.

For the first test round, please set up a test node and join the test network: Please join our public test network
Later this week we are going to deploy v1.51.0-rc on the test network. Hopefully, your test node will update just fine.

After the first test round, there is still the risk that we crash production. So in a second test round, we would like to update as many mainnet nodes as possible before the docker image gets tagged as the latest.

The last step after we finished both tests rounds will be a cleanup. For both test rounds, you will have to specify a specific docker image 5f2777af9-v1.50.4-go1.17.5. At the end please reconfigure your system to switch back to the latest docker image.

Hmm, I’m not sure who’s idea this was in Storj but this isn’t very dockerish having applications update themselves beyond the tagged versions of the docker image…

The whole beauty of docker is being confident that you can pull an exact image and version of software, and know what it contains - I don’t like this behaviour, I like to be in control of my node and when it updates…

Please make sure you provide a switch to disable this behaviour, for SNO’s who don’t want this - although it’s easy to prevent in docker if you don’t :smiley:

Thanks

CP

2 Likes

This issue might give context on why we decided to this: Support slow storage node version rollout for all existing Linux/Docker installations · Issue #4489 · storj/storj · GitHub

1 Like

Not a great start on my test node. Swapped out the image and it got stuck in restarting with the following repeated logs.

image
I also noticed that it’s connecting to the production version server instead of test for the initial download of the updater. Should we be setting VERSION_SERVER_URL in the run command? It seems to default to production in the docker file atm.

I’ll revert for now.

Let me provide some context. This is on a Synology DS3617xs, which is amd64 architecture. The only thing I changed in my (working) run command was the image tag.

docker run -d --restart unless-stopped --stop-timeout 300 -p 28968:28967/tcp -p 28968:28967/udp -p 14003:14002 \
    -e WALLET="redacted" \
    -e EMAIL="redacted" \
    -e ADDRESS="redacted:28968" \
    -e STORAGE="1TB" \
    --mount type=bind,source="/volume1/storj/test/identity",destination=/app/identity \
    --mount type=bind,source="/volume1/storj/test/data",destination=/app/config \
    --name storagenodetest storjlabs/storagenode:92ab23202-v1.50.3-go1.17.5 \
    --operator.wallet-features="zksync" \
    --log.level="debug" \
    --log.output="/app/config/node.log" \
    --server.use-peer-ca-whitelist="false" \
    --storage2.trust.sources="1GGZktUwmMKTwTWNcmGnFJ3n7rjE58QnNcRp98Y23MmbDnVoiU@satellite.qa.storj.io:7777,12ZQbQ8WWFEfKNE9dP78B1frhJ8PmyYmr8occLEf1mQ1ovgVWy@testnet.satellite.stefan-benten.de:7777" \
    --version.server-address="https://version.qa.storj.io"
1 Like

Hi clement,

Thanks for the background, don’t get me wrong I completely get it - I’m just trying to provide some constructive help, to avoid lots of dev headache.

  • Again, I will say that your approach is not what happens at Enterprise level - there’s a good reason for this, and why you won’t find lots of opensource examples of this - docker can be very naughty, and it’s just an absolute pain to try and do what the team might be aiming at - before you know it, you have written a complete system monitoring package to try and cover all the scenario’s which occur when the container main process breaks.

Some other thoughts for you, again trying to give some other options :slight_smile:

  1. Think about adopting a plugin approach for the Storagenode codebase - that way the docker image can be a major release, but it downloads pluggins to address these enhancements - this has the benefit of reducing the element of risk, and also prevents you from trying to restart the main init process (nightmare honestly)

  2. Think about decoupling your current link to Satellite release code base schedule - really looking at github, the number of major changes to Storagenode that past 18 months are hard to justify a rolling 4-5weeks upgrade - it’s making working for Dev team, and Enterprise customers like to see stability in the network.

  3. There are other docker repos available - you could very easily host the Storagenode Latest in another repo - you would then be able to rate limit pull requests, you can Geo target them, you could also ask SNO’s to add extra tags to control the upgrade stream they are in - you are also able to very quickly back-out changes… This is much more intune with how docker intend for these things to work.

  4. you could just ask the SNO’s with docker to run a version behind :slight_smile: as probably only a few % of them frequent this forum, it might give you a good split. I’m not sure about all the other SNO’s but I manually upgrade, and always run a version behind - I surely can’t be the only one out of 13k nodes ! … If I am, can I have special forum badge please :slight_smile:

Anyhow, good you are testing this approach, and communicating with us.

CP

3 Likes

Oh no. I did a mistake. Version control is currently pointing to a not existing version. Should be fixed now. Please retry.

1 Like

Getting the same result. Should I be using a new image?

image

Also, this seems like a pretty big flaw in the system. If for some reason the url isn’t reachable or the file doesn’t exist, recreating the container would simply make it impossible for the node to start again. Could a fallback be built in that keeps a copy of the last working binaries in the storage location and reverts to them in case something like that happens? Otherwise I’m really not feeling good about this approach.

I also agree with @CutieePie that this isn’t really how docker should work. You already use a custom version of watchtower, wouldn’t it be possible to make watchtower aware of the latest version have it use the rollout system to determine when to trigger automated updates of the node? I’m aware that this wouldn’t block SNOs from updating earlier, but how many would really do that manually? And if they do, how likely is it that they will do that very quickly?

I just really don’t like that I could end up with a container that is unable to even download the binary.

Ok great. I did the second mistake as well and forgot to publish the v1.50.3 release. Now that is also fixed and I see no more reason why it shouldn’t find the binaries now.

Yes you should set VERSION_SERVER_URL in the run command to point to https://version.qa.storj.io

Any way to opt out of this? It looks bad to me.

Will do! This should probably be added here as well then: Please join our public test network

Tried again and also set the VERSION_SERVER_URL in the run command, but same result.
image

Also tried without the VERSION_SERVER_URL set and get the same error (with different url of course).

Or maybe:

--version.server-url="https://version.qa.storj.io"

There is another option for that which I already set. But the entrypoint script uses the environment variable before the binaries ever run. So for now you have to set it twice. I would suggest using the environment variable for both. But as you can see from the log I posted, setting the environment variable made it use the QA server. But it still didn’t work.

The issue linked above implies that this fixes a problem of being unable to push a fix for bad code to docker containers. Does that mean this updater will only be used in rare cases of known bad code as opposed to a general updater? I would consider this a much better idea than having a self-updating docker container. That would at a minimum increase the difficulty of downgrading.

downgrade is a bad idea. If there are migrations for database - it will not be reverted and your downgraded version would not work.

I’ve been trying to reproduce the issue for hours but still unable to reproduce; at least on docker desktop for Mac.
I’m still unable to reproduce this with both test and production version server URL:

docker run -d --restart unless-stopped --stop-timeout 300 -p 14003:14002 -p 28968:28967/tcp -p 28968:28967/udp \
 -e WALLET="0xXX" \
 -e EMAIL="email" \
 -e ADDRESS="dns:28968" \
 -e STORAGE="1TB" \
 -e VERSION_SERVER_URL="https://version.qa.storj.io" \
 --mount type=bind,source="...",destination=/app/identity \
 --mount type=bind,source="...",destination=/app/config \
 --name nodetestnet \
 storjlabs/storagenode:35efb6462-go1.17.5

The output as expected:

downloading storagenode-updater

Connecting to version.qa.storj.io (35.188.169.133:443)

writing to stdout

-                    100% |********************************|    92  0:00:00 ETA

written to stdout

Connecting to github.com (140.82.121.3:443)

Connecting to objects.githubusercontent.com (185.199.111.133:443)

saving to '/tmp/storagenode-updater.zip'

storagenode-updater. 100% |********************************| 7910k  0:00:00 ETA

...

Anything I can do on my end to help figure it out?

What happens when you change the entrypoint to sh and connect to the QA version server with wget?

This should open the terminal inside the container:

docker run -i -t --name testnetwork --rm --entrypoint sh storjlabs/storagenode:35efb6462-go1.17.5

Connect to the QA version server:

wget -O- version.storj.io/processes/storagenode/minimum/url?os=linux&arch=amd64

If wget fails to connect, install GNU wget and try again:

apk add wget

wget -O- version.storj.io/processes/storagenode/minimum/url?os=linux&arch=amd64
$ docker run -i -t --name testnetwork --rm --entrypoint sh storjlabs/storagenode:35efb6462-go1.17.5
Unable to find image 'storjlabs/storagenode:35efb6462-go1.17.5' locally
35efb6462-go1.17.5: Pulling from storjlabs/storagenode
Digest: sha256:57afdc03a5e31106bc929bd24c201150dcfffecaab68be65c9f45a04e5776940
Status: Downloaded newer image for storjlabs/storagenode:35efb6462-go1.17.5
/app # wget -O- version.storj.io/processes/storagenode/minimum/url?os=linux&arch=amd64
/app # Connecting to version.storj.io (35.224.88.204:80)
Connecting to version.storj.io (35.224.88.204:443)
wget: error getting response: Connection reset by peer
apk add wget
fetch https://dl-cdn.alpinelinux.org/alpine/v3.15/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.15/community/x86_64/APKINDEX.tar.gz
(1/3) Installing libunistring (0.9.10-r1)
(2/3) Installing libidn2 (2.3.2-r0)
(3/3) Installing wget (1.21.2-r2)
Executing busybox-1.34.1-r3.trigger
OK: 67 MiB in 38 packages
[1]+  Done(1)                    wget -O- version.storj.io/processes/storagenode/minimum/url?os=linux
/app # wget -O- version.storj.io/processes/storagenode/minimum/url?os=linux&arch=amd64
/app #
Redirecting output to 'wget-log'.
/app #

I did also notice you were using a different image than in the top post.

Uh, so my suspicion has been confirmed. It is an issue with BusyBox wget, so we have to switch to GNU wget in the storagenode-base image.