[Tech Preview] Storagenode Updater inside Docker

littleskunk · March 17, 2022, 10:54pm

I get the same error message with VERSION_SERVER_URL="https://version.qa.storj.io" but it works fine with VERSION_SERVER_URL=https://version.qa.storj.io

littleskunk · March 17, 2022, 10:59pm

But at the end the process just hangs:

testnode_1   | 2022-03-17T22:57:34.734Z INFO    Running on version      {"Service": "storagenode-updater", "Version": "v1.50.3"}
testnode_1   | 2022-03-17T22:57:34.734Z INFO    Downloading versions.   {"Server Address": "https://version.qa.storj.io"}
testnode_1   | 2022-03-17T22:57:35.212Z INFO    Current binary version  {"Service": "storagenode", "Version": "v1.48.2"}
testnode_1   | 2022-03-17T22:57:35.212Z INFO    Download started.       {"From": "https://github.com/storj/storj/releases/download/v1.50.3/storagenode_linux_amd64.zip", "To": "/tmp/storagenode_linux_amd64.1411460718.zip"}
testnode_1   | 2022-03-17T22:57:38.131Z INFO    Download finished.      {"From": "https://github.com/storj/storj/releases/download/v1.50.3/storagenode_linux_amd64.zip", "To": "/tmp/storagenode_linux_amd64.1411460718.zip"}
testnode_1   | 2022-03-17 22:57:38,131 INFO waiting for processes, storagenode, storagenode-updater to die
testnode_1   | 2022-03-17T22:57:38.155Z INFO    Restarting service.     {"Service": "storagenode"}
testnode_1   | 2022-03-17T22:57:38.540Z ERROR   Error updating service. {"Service": "storagenode", "error": "error stopping storagenode service: strconv.Atoi: parsing \"unix:///run/supervisord.sock no such file\": invalid syntax", "errorVerbose": "error stopping storagenode service: strconv.Atoi: parsing \"unix:///run/supervisord.sock no such file\": invalid syntax\n\tmain.restartService:37\n\tmain.update:68\n\tmain.loopFunc:27\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tmain.cmdRun:128\n\tstorj.io/private/process.cleanup.func1.4:363\n\tstorj.io/private/process.cleanup.func1:381\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.Exec:65\n\tmain.main:12\n\truntime.main:255"}
testnode_1   | 2022-03-17T22:57:38.552Z INFO    Current binary version  {"Service": "storagenode-updater", "Version": "v1.50.3"}
testnode_1   | 2022-03-17T22:57:38.553Z INFO    Version is up to date   {"Service": "storagenode-updater"}
testnode_1   | 2022-03-17T22:57:38.771Z ERROR   contact:service contact/service.go:103  ping satellite failed   {"Satellite ID": "12ZQbQ8WWFEfKNE9dP78B1frhJ8PmyYmr8occLEf1mQ1ovgVWy", "attempts": 4, "error": "ping satellite: check-in ratelimit: node rate limited by id", "errorVerbose": "ping satellite: check-in ratelimit: node rate limited by id\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:136\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:98\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
testnode_1   | storj.io/storj/storagenode/contact.(*Service).pingSatellite
testnode_1   |  /go/src/storj.io/storj/storagenode/contact/service.go:103
testnode_1   | storj.io/storj/storagenode/contact.(*Chore).updateCycles.func1
testnode_1   |  /go/src/storj.io/storj/storagenode/contact/chore.go:87
testnode_1   | storj.io/common/sync2.(*Cycle).Run
testnode_1   |  /go/pkg/mod/storj.io/common@v0.0.0-20220131120956-e74f624a3d55/sync2/cycle.go:92
testnode_1   | storj.io/common/sync2.(*Cycle).Start.func1
testnode_1   |  /go/pkg/mod/storj.io/common@v0.0.0-20220131120956-e74f624a3d55/sync2/cycle.go:71
testnode_1   | golang.org/x/sync/errgroup.(*Group).Go.func1
testnode_1   |  /go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57
testnode_1   | 2022-03-17 22:57:41,773 INFO waiting for processes, storagenode, storagenode-updater to die
testnode_1   | 2022-03-17 22:57:44,776 WARN killing 'storagenode-updater' (104) with SIGKILL
testnode_1   | 2022-03-17 22:57:44,777 INFO waiting for processes, storagenode, storagenode-updater to die
testnode_1   | 2022-03-17 22:57:44,780 INFO stopped: storagenode-updater (terminated by SIGKILL)
testnode_1   | 2022-03-17T22:57:44.780Z INFO    process/exec_conf.go:116        Got a signal from the OS: "terminated"
testnode_1   | 2022-03-17T22:57:44.781Z INFO    contact:service contact/service.go:107  context cancelled       {"Satellite ID": "12ZQbQ8WWFEfKNE9dP78B1frhJ8PmyYmr8occLEf1mQ1ovgVWy"}
testnode_1   | 2022-03-17 22:57:44,805 INFO stopped: storagenode (exit status 0)
testnode_1   | 2022-03-17 22:57:44,806 INFO stopped: processes (terminated by SIGTERM)

Now the docker container is still running and so docker doesn’t restart. Not a great ending.

BrightSilence · March 17, 2022, 11:01pm

Neither of those work on my end unfortunately.

I’m a little confused, if we’re all using the same docker image, we should all be using the same wget version. Why is it just me getting this error?

clement · March 18, 2022, 8:00am

Alpine docker image by default uses the builtin BusyBox wget which is not capable of handling SSL traffic via proxy unlike the GNU wget. Seems like you’re behind a firewall/proxy?

BrightSilence · March 18, 2022, 8:12am

Not that I’m aware. For proxy that is. Isn’t everyone behind a firewall? I do use double NAT because of ISP complications. And I guess firewalls on both routers and on Synology NAS.

I guess it doesn’t matter that much, I can’t be the only one running into this, so I think it will need to be fixed before taking this to production or quite a few nodes will go dead.

littleskunk · March 18, 2022, 6:56pm

A new version is available: 5f2777af9-v1.50.4-go1.17.5

It seems to work for me. I noticed one minor issue when running as non root user. I will have to compare some log outputs to rule out a mistake on my end.

BrightSilence · March 18, 2022, 8:11pm

That seems to have done the trick.

I am getting an error though, maybe the same one you had?

2022-03-18T20:08:33.067Z FATAL Unrecoverable error {"error": "Error during preflight check for storagenode databases: preflight: database: \"storage_usage\": failed inserting test value: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).preflight:428\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Preflight:353\n\tmain.cmdRun:241\n\tstorj.io/private/process.cleanup.func1.4:363\n\tstorj.io/private/process.cleanup.func1:381\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.ExecCustomDebug:70\n\tmain.main:474\n\truntime.main:255", "errorVerbose": "Error during preflight check for storagenode databases: preflight: database: \"storage_usage\": failed inserting test value: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).preflight:428\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Preflight:353\n\tmain.cmdRun:241\n\tstorj.io/private/process.cleanup.func1.4:363\n\tstorj.io/private/process.cleanup.func1:381\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.ExecCustomDebug:70\n\tmain.main:474\n\truntime.main:255\n\tmain.cmdRun:243\n\tstorj.io/private/process.cleanup.func1.4:363\n\tstorj.io/private/process.cleanup.func1:381\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.ExecCustomDebug:70\n\tmain.main:474\n\truntime.main:255"}

Despite it being listed as a FATAL error, the node seems to run just fine after that though.

littleskunk · March 18, 2022, 8:21pm

I am running my storage node as a restricted user now. That also works. For reference here is my docker compose file:

version: "3.3"
services:
  multinode:
    restart: unless-stopped
    user: 1000:1000
    image: storjlabs/multinode:latest
    ports:
      - 15002:15002/tcp
    volumes:
      - /home/storagenode/.local/share/storj/identity/multinode:/app/identity
      - /home/storagenode/.local/share/storj/multinode:/app/config
  testnode:
    restart: unless-stopped
    user: 1000:1000
    image: storjlabs/storagenode:5f2777af9-v1.50.4-go1.17.5
    ports:
      - 12012:14002/tcp
      - 12019:12019/tcp
      - 12010:12010/tcp
      - 12010:12010/udp
    volumes:
      - /mnt/testnode/storagenode/identity/storagenode:/app/identity
      - /mnt/testnode/storagenode/storagenode:/app/config
    environment:
      - STORJ_SERVER_ADDRESS=:12010
      - STORJ_DEBUG_ADDR=:12019
      - STORJ_CONTACT_EXTERNAL_ADDRESS=<my ddns name>:12010
      - STORJ_OPERATOR_WALLET=0x0000000000000000000000000000000000000000
      - STORJ_OPERATOR_EMAIL=<my email>
      - STORJ_LOG_LEVEL=debug
      - STORJ_LOG_CALLER=true
      - STORJ_LOG_STACK=true
      - STORJ_SERVER_USE_PEER_CA_WHITELIST=false
      - STORJ_STORAGE2_TRUST_SOURCES=1GGZktUwmMKTwTWNcmGnFJ3n7rjE58QnNcRp98Y23MmbDnVoiU@satellite.qa.storj.io:7777,12ZQbQ8WWFEfKNE9dP78B1frhJ8PmyYmr8occLEf1mQ
1ovgVWy@testnet.satellite.stefan-benten.de:7777
      - STORJ_VERSION_SERVER_ADDRESS=https://version.qa.storj.io
      - VERSION_SERVER_URL=https://version.qa.storj.io
      - STORJ_ALLOCATED_DISK_SPACE=100GB
      - STORJ_STORAGE2_MONITOR_MINIMUM_DISK_SPACE=0B

BrightSilence · March 18, 2022, 8:24pm

I did notice these two lines in the docker logs.

2022-03-18 20:08:27,212 CRIT Supervisor is running as root.  Privileges were not dropped because no user is specified in the config file.  If you intend to run as root, you can set user=root in the config file to avoid this message.

2022-03-18 20:08:27,218 CRIT Server 'inet_http_server' running without any HTTP authentication checking

That kind of sounds like it might be a security issue. But I guess it depends on how it’s used.
Edit: looks like it’s only listening to localhost. So shouldn’t matter. storj/supervisord.conf at 5f2777af9714de21e4c68a74d152f117c7d09588 · storj/storj · GitHub

Additionally, I hope some others jump in on testing this. This topic hasn’t gotten much response from people actually testing it and I think you would need a bit broader testing base, especially since this can’t really be rolled out on production to just a small controlled sample of nodes. (last time it can’t I guess)

clement · March 21, 2022, 12:31pm

This is logged at critical level by supervisord because it is assumed as a critical security concern when running supervisord without authentication checks. This is safe to ignore as long as we do not allow HTTP access outside the container; it is configured to only listen to localhost, so it’s reasonable to run without HTTP authentication checks

clement · March 21, 2022, 12:34pm

We intentionally unset the user=root in the supervisord.conf to allow running the container as a non-root user. When running as a root user, supervisord will automatically run as a root user but logs this warning. If the container is run as a restricted user, supervisord will run using the user privileges.

BrightSilence · March 21, 2022, 3:44pm

Mkay, both of those make sense to me. It’s a bit unfortunate that it’s logged as CRIT as it may lead to questions, but I don’t see a simple way around that either. I thought the user=root may have been a possible fix, but there is a good reason to not use that. So that’s indeed not an option.

clement · March 21, 2022, 5:52pm

There is a suggested solution to set dummy user/password in the [unix_http_server] and [supervisorctl] sections of the config file to remove the CRIT log entries: CRIT Server 'unix_http_server' running without any HTTP authentication checking · Issue #717 · Supervisor/supervisor · GitHub

I don’t really like this suggestion.

BrightSilence · March 21, 2022, 6:11pm

That’s missing the most important word, haha. Well it’s up to you of course. But with many people looking at logs, I think there will be questions. If it doesn’t impact anything else, I would probably use this suggestion. But as it’s not actually anything critical, it isn’t that important.

Bivvo · June 29, 2022, 8:46pm

Hmm, haven’t found a concrete answer to that question. Had this “FATAL” message today, once, as well. Anything to concern about? @BrightSilence @littleskunk

BrightSilence · June 29, 2022, 9:27pm

It hasn’t caused any problems for me. It just happened once and the node kept working. Still FATAL errors seem like a legitimate concern to me. For me it was happening on a test node. Is it happening on production nodes for you? Maybe @littleskunk or @clement can look into that one. As you can see I didn’t really get a response to that error here. Either it didn’t worry them or it got overlooked.

pwilloughby · June 29, 2022, 11:30pm

I think it’s probably more of an annoyance than a concern unless you get stuck in a loop of these errors. It sounds like the check is a bit flaky.

Bivvo · June 30, 2022, 6:58am

Yes it seems it was (once) thrown after (or before) the latest update was done. Maybe linked to the update process (?).