[Tech Preview] Storagenode Updater inside Docker

littleskunk · March 16, 2022, 6:13pm

v1.50.3 is ready for the storage node rollout. This time we have the storage node updater inside docker to test. The risks are high. Worst case docker nodes will not startup or will get stuck on updating to the following release. We need your help to mitigate these risks on as many test machines as possible. Please help us.

The new docker image is already published but not tagged as the latest. So your existing docker node will not install it yet. We would like to run 2 tests with your help.

For the first test round, please set up a test node and join the test network: Please join our public test network
Later this week we are going to deploy v1.51.0-rc on the test network. Hopefully, your test node will update just fine.

After the first test round, there is still the risk that we crash production. So in a second test round, we would like to update as many mainnet nodes as possible before the docker image gets tagged as the latest.

The last step after we finished both tests rounds will be a cleanup. For both test rounds, you will have to specify a specific docker image 5f2777af9-v1.50.4-go1.17.5. At the end please reconfigure your system to switch back to the latest docker image.

clement · March 16, 2022, 11:58pm

This issue might give context on why we decided to this: Support slow storage node version rollout for all existing Linux/Docker installations · Issue #4489 · storj/storj · GitHub

BrightSilence · March 17, 2022, 8:28am

Not a great start on my test node. Swapped out the image and it got stuck in restarting with the following repeated logs.

I also noticed that it’s connecting to the production version server instead of test for the initial download of the updater. Should we be setting VERSION_SERVER_URL in the run command? It seems to default to production in the docker file atm.

I’ll revert for now.

Let me provide some context. This is on a Synology DS3617xs, which is amd64 architecture. The only thing I changed in my (working) run command was the image tag.

docker run -d --restart unless-stopped --stop-timeout 300 -p 28968:28967/tcp -p 28968:28967/udp -p 14003:14002 \
    -e WALLET="redacted" \
    -e EMAIL="redacted" \
    -e ADDRESS="redacted:28968" \
    -e STORAGE="1TB" \
    --mount type=bind,source="/volume1/storj/test/identity",destination=/app/identity \
    --mount type=bind,source="/volume1/storj/test/data",destination=/app/config \
    --name storagenodetest storjlabs/storagenode:92ab23202-v1.50.3-go1.17.5 \
    --operator.wallet-features="zksync" \
    --log.level="debug" \
    --log.output="/app/config/node.log" \
    --server.use-peer-ca-whitelist="false" \
    --storage2.trust.sources="1GGZktUwmMKTwTWNcmGnFJ3n7rjE58QnNcRp98Y23MmbDnVoiU@satellite.qa.storj.io:7777,12ZQbQ8WWFEfKNE9dP78B1frhJ8PmyYmr8occLEf1mQ1ovgVWy@testnet.satellite.stefan-benten.de:7777" \
    --version.server-address="https://version.qa.storj.io"

littleskunk · March 17, 2022, 11:04am

Oh no. I did a mistake. Version control is currently pointing to a not existing version. Should be fixed now. Please retry.

BrightSilence · March 17, 2022, 11:15am

Getting the same result. Should I be using a new image?

Also, this seems like a pretty big flaw in the system. If for some reason the url isn’t reachable or the file doesn’t exist, recreating the container would simply make it impossible for the node to start again. Could a fallback be built in that keeps a copy of the last working binaries in the storage location and reverts to them in case something like that happens? Otherwise I’m really not feeling good about this approach.

I also agree with @CutieePie that this isn’t really how docker should work. You already use a custom version of watchtower, wouldn’t it be possible to make watchtower aware of the latest version have it use the rollout system to determine when to trigger automated updates of the node? I’m aware that this wouldn’t block SNOs from updating earlier, but how many would really do that manually? And if they do, how likely is it that they will do that very quickly?

I just really don’t like that I could end up with a container that is unable to even download the binary.

littleskunk · March 17, 2022, 11:32am

Ok great. I did the second mistake as well and forgot to publish the v1.50.3 release. Now that is also fixed and I see no more reason why it shouldn’t find the binaries now.

clement · March 17, 2022, 12:11pm

Yes you should set VERSION_SERVER_URL in the run command to point to https://version.qa.storj.io

Pentium100 · March 17, 2022, 1:19pm

Any way to opt out of this? It looks bad to me.

BrightSilence · March 17, 2022, 2:35pm

Will do! This should probably be added here as well then: Please join our public test network

Tried again and also set the VERSION_SERVER_URL in the run command, but same result.

Also tried without the VERSION_SERVER_URL set and get the same error (with different url of course).

peem · March 17, 2022, 2:59pm

Or maybe:

--version.server-url="https://version.qa.storj.io"

BrightSilence · March 17, 2022, 3:40pm

There is another option for that which I already set. But the entrypoint script uses the environment variable before the binaries ever run. So for now you have to set it twice. I would suggest using the environment variable for both. But as you can see from the log I posted, setting the environment variable made it use the QA server. But it still didn’t work.

LrrrAc · March 17, 2022, 5:39pm

The issue linked above implies that this fixes a problem of being unable to push a fix for bad code to docker containers. Does that mean this updater will only be used in rare cases of known bad code as opposed to a general updater? I would consider this a much better idea than having a self-updating docker container. That would at a minimum increase the difficulty of downgrading.

Alexey · March 17, 2022, 7:05pm

downgrade is a bad idea. If there are migrations for database - it will not be reverted and your downgraded version would not work.

clement · March 17, 2022, 10:26pm

I’ve been trying to reproduce the issue for hours but still unable to reproduce; at least on docker desktop for Mac.
I’m still unable to reproduce this with both test and production version server URL:

docker run -d --restart unless-stopped --stop-timeout 300 -p 14003:14002 -p 28968:28967/tcp -p 28968:28967/udp \
 -e WALLET="0xXX" \
 -e EMAIL="email" \
 -e ADDRESS="dns:28968" \
 -e STORAGE="1TB" \
 -e VERSION_SERVER_URL="https://version.qa.storj.io" \
 --mount type=bind,source="...",destination=/app/identity \
 --mount type=bind,source="...",destination=/app/config \
 --name nodetestnet \
 storjlabs/storagenode:35efb6462-go1.17.5

The output as expected:

downloading storagenode-updater

Connecting to version.qa.storj.io (35.188.169.133:443)

writing to stdout

-                    100% |********************************|    92  0:00:00 ETA

written to stdout

Connecting to github.com (140.82.121.3:443)

Connecting to objects.githubusercontent.com (185.199.111.133:443)

saving to '/tmp/storagenode-updater.zip'

storagenode-updater. 100% |********************************| 7910k  0:00:00 ETA

...

BrightSilence · March 17, 2022, 10:30pm

Anything I can do on my end to help figure it out?

clement · March 17, 2022, 10:40pm

What happens when you change the entrypoint to sh and connect to the QA version server with wget?

This should open the terminal inside the container:

docker run -i -t --name testnetwork --rm --entrypoint sh storjlabs/storagenode:35efb6462-go1.17.5

Connect to the QA version server:

wget -O- version.storj.io/processes/storagenode/minimum/url?os=linux&arch=amd64

If wget fails to connect, install GNU wget and try again:

apk add wget

wget -O- version.storj.io/processes/storagenode/minimum/url?os=linux&arch=amd64

BrightSilence · March 17, 2022, 10:44pm

$ docker run -i -t --name testnetwork --rm --entrypoint sh storjlabs/storagenode:35efb6462-go1.17.5
Unable to find image 'storjlabs/storagenode:35efb6462-go1.17.5' locally
35efb6462-go1.17.5: Pulling from storjlabs/storagenode
Digest: sha256:57afdc03a5e31106bc929bd24c201150dcfffecaab68be65c9f45a04e5776940
Status: Downloaded newer image for storjlabs/storagenode:35efb6462-go1.17.5
/app # wget -O- version.storj.io/processes/storagenode/minimum/url?os=linux&arch=amd64
/app # Connecting to version.storj.io (35.224.88.204:80)
Connecting to version.storj.io (35.224.88.204:443)
wget: error getting response: Connection reset by peer
apk add wget
fetch https://dl-cdn.alpinelinux.org/alpine/v3.15/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.15/community/x86_64/APKINDEX.tar.gz
(1/3) Installing libunistring (0.9.10-r1)
(2/3) Installing libidn2 (2.3.2-r0)
(3/3) Installing wget (1.21.2-r2)
Executing busybox-1.34.1-r3.trigger
OK: 67 MiB in 38 packages
[1]+  Done(1)                    wget -O- version.storj.io/processes/storagenode/minimum/url?os=linux
/app # wget -O- version.storj.io/processes/storagenode/minimum/url?os=linux&arch=amd64
/app #
Redirecting output to 'wget-log'.
/app #

I did also notice you were using a different image than in the top post.

clement · March 17, 2022, 10:48pm

Uh, so my suspicion has been confirmed. It is an issue with BusyBox wget, so we have to switch to GNU wget in the storagenode-base image.

BrightSilence · March 17, 2022, 10:50pm

I think you’d also have to encapsulate the url in quotes. It seems to push it to the background because of the & in there.

$ docker run -i -t --name testnetwork --rm --entrypoint sh storjlabs/storagenode:35efb6462-go1.17.5
/app # wget -O- "version.storj.io/processes/storagenode/minimum/url?os=linux&arch=amd64"
Connecting to version.storj.io (35.224.88.204:80)
Connecting to version.storj.io (35.224.88.204:443)
wget: error getting response: Connection reset by peer
/app # wget -O- "https://version.storj.io/processes/storagenode/minimum/url?os=linux&arch=amd64"
Connecting to version.storj.io (35.224.88.204:443)
wget: error getting response: Connection reset by peer
/app # apk add wget
fetch https://dl-cdn.alpinelinux.org/alpine/v3.15/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.15/community/x86_64/APKINDEX.tar.gz
(1/3) Installing libunistring (0.9.10-r1)
(2/3) Installing libidn2 (2.3.2-r0)
(3/3) Installing wget (1.21.2-r2)
Executing busybox-1.34.1-r3.trigger
OK: 67 MiB in 38 packages
/app # wget -O- "https://version.storj.io/processes/storagenode/minimum/url?os=linux&arch=amd64"
--2022-03-17 22:49:47--  https://version.storj.io/processes/storagenode/minimum/url?os=linux&arch=amd64
Resolving version.storj.io (version.storj.io)... 35.224.88.204
Connecting to version.storj.io (version.storj.io)|35.224.88.204|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84 [text/plain]
Saving to: 'STDOUT'

-                                                                        0%[                                                                                                                                                                             ]       0  --.-KB/s               h-                                                                      100%[============================================================================================================================================================================>]      84  --.-KB/s    in 0s

2022-03-17 22:49:47 (25.3 MB/s) - written to stdout [84/84]

/app #

Looks like the entrypoint already does that though.

github.com

storj/storj/blob/92ab23202307c92074159c45c5c0ba6d2332b038/cmd/storagenode/docker/entrypoint#L7

      
        
            #!/bin/sh
            set -euo pipefail
            
            
get_default_url() {
              process=$1
              version=$2
              wget -O- "${VERSION_SERVER_URL}/processes/${process}/${version}/url?os=linux&arch=${GOARCH}"
            }
            
            
get_binary() {
              binary=$1
              url=$2
              wget -O "/tmp/${binary}.zip" "${url}"
              unzip -p "/tmp/${binary}.zip" > "/app/${binary}"
              rm "/tmp/${binary}.zip"
              chmod u+x "/app/${binary}"
            }

clement · March 17, 2022, 10:53pm

Yes that’s true, but it still fails with BusyBox wget