With version 1.16.1 2nd node not working

jammerdan · November 11, 2020, 6:04am

Second node on same server no longer working after updating to 1.16.1:

Nothing has been changed. 1st node runs on ports 7778/28967, 2nd node on 7779/28968.

First node is still running without problem.

2nd node now is running into an error:
Unrecoverable error {"error": "rpc: dial tcp 127.0.0.1:7779: connect: connection refused", "errorVerbose": "rpc: dial tcp 127.0.0.1:7779: connect: connection refused\n\tstorj.io/common/rpc.TCPConnector.DialContextUnencrypted:107\n\tstorj.io/common/rpc.Dialer.dialTransportUnencrypted:178\n\tstorj.io/common/rpc.Dialer.dialUnencrypted.func1:161\n\tstorj.io/common/rpc/rpcpool.(*Conn).newConn:129\n\tstorj.io/common/rpc/rpcpool.(*Conn).getConn:158\n\tstorj.io/common/rpc/rpcpool.(*Conn).Invoke:194\n\tstorj.io/common/rpc/rpctracing.(*TracingWrapper).Invoke:31\n\tstorj.io/common/pb.(*drpcPieceStoreInspectorClient).Dashboard:1087\n\tmain.(*dashboardClient).dashboard:42\n\tmain.cmdDashboard:70\n\tstorj.io/private/process.cleanup.func1.4:362\n\tstorj.io/private/process.cleanup.func1:380\n\tgithub.com/spf13/cobra.(*Command).execute:842\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:950\n\tgithub.com/spf13/cobra.(*Command).Execute:887\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.ExecCustomDebug:70\n\tmain.main:336\n\truntime.main:204"}

nerdatwork · November 11, 2020, 6:39am

Try basic troubleshooting with the checklist

Make sure you didn’t add or remove space in your command (if its docker)

jammerdan · November 11, 2020, 7:15am

It is not Docker. And as I have said, I did not change anything except replacing the storagenode and dashboard binaries with the new versions.
Node 1 is running. Node 2 enters a failed state always after

storagenode[9952]: /usr/local/go/src/net/fd_unix.go:172 +0x45
storagenode[9952]: net.(*TCPListener).accept(0xc0001e31e0, 0xc00077c758, 0xc0002d2bb0, 0xc00067c470)
storagenode[9952]: /usr/local/go/src/net/tcpsock_posix.go:139 +0x32
storagenode[9952]: net.(*TCPListener).AcceptTCP(0xc0001e31e0, 0xc00077c7a8, 0xaa1b50, 0xc0002d2b80)
storagenode[9952]: /usr/local/go/src/net/tcpsock.go:248 +0x65
storagenode[9952]: storj.io/storj/pkg/server.(*userTimeoutListener).Accept(0xc0000b6088, 0x0, 0x0, 0x0, 0x0)
storagenode[9952]: /go/src/storj.io/storj/pkg/server/listener.go:40 +0x32
storagenode[9952]: storj.io/storj/pkg/listenmux.(*Mux).monitorBase(0xc0001e00e0)
storagenode[9952]: /go/src/storj.io/storj/pkg/listenmux/mux.go:112 +0x65
storagenode[9952]: created by storj.io/storj/pkg/listenmux.(*Mux).Run
storagenode[9952]: /go/src/storj.io/storj/pkg/listenmux/mux.go:88 +0xe5
storagenode[9952]: goroutine 1286 [select]:
storagenode[9952]: storj.io/common/sync2.Sleep(0x107c5c0, 0xc0003d3350, 0x24852b99d, 0xc000648000)
storagenode[9952]: /go/pkg/mod/storj.io/common@v0.0.0-20201014090530-c4af8e54d5c4/sync2/sleep.go:16 +0x115
storagenode[9952]: storj.io/storj/storagenode/nodestats.(*Cache).sleep(0xc0002883f0, 0x107c5c0, 0xc0003d3350, 0xc000ad2ca0, 0xc000ad2cf0)
storagenode[9952]: /go/src/storj.io/storj/storagenode/nodestats/cache.go:241 +0x5d
storagenode[9952]: storj.io/storj/storagenode/nodestats.(*Cache).Run.func2(0x107c5c0, 0xc0003d3350, 0x107c5c0, 0xc0003d3350)
storagenode[9952]: /go/src/storj.io/storj/storagenode/nodestats/cache.go:115 +0x5e
storagenode[9952]: storj.io/common/sync2.(*Cycle).Run(0xc0001889c0, 0x107c2c0, 0xc00034f7c0, 0xc0003d3320, 0x0, 0x0)
storagenode[9952]: /go/pkg/mod/storj.io/common@v0.0.0-20201014090530-c4af8e54d5c4/sync2/cycle.go:92 +0x168
storagenode[9952]: storj.io/common/sync2.(*Cycle).Start.func1(0xc000072aa4, 0xf9a808)
storagenode[9952]: /go/pkg/mod/storj.io/common@v0.0.0-20201014090530-c4af8e54d5c4/sync2/cycle.go:71 +0x45
storagenode[9952]: golang.org/x/sync/errgroup.(*Group).Go.func1(0xc00071e1b0, 0xc0008521e0)
storagenode[9952]: /go/pkg/mod/golang.org/x/sync@v0.0.0-20200625203802-6e8e738ad208/errgroup/errgroup.go:57 +0x59
storagenode[9952]: created by golang.org/x/sync/errgroup.(*Group).Go
storagenode[9952]: /go/pkg/mod/golang.org/x/sync@v0.0.0-20200625203802-6e8e738ad208/errgroup/errgroup.go:54 +0x66
systemd[1]: storj01.service: main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: Unit storj01.service entered failed state.

These are new log lines I have not seen before v 1.16.1.

How could this only affect node 2 and not also node 1?

jammerdan · November 11, 2020, 9:03am

Is it possible to switch back to v 1.15.3 ?

jammerdan · November 11, 2020, 10:44am

Anyone has an idea? The node keeps erroring and is basically offline.

deathlessdd · November 11, 2020, 1:52pm

Yes you can go back to 1.15.3 I tested with my node with no issues that I seen. But probably not recommended since it does add a new database.

jammerdan · November 11, 2020, 2:06pm

Thanks. I have just tested.
Now with v1.15.3 the 2nd node runs normally as before.

@Alexey, @nerdatwork
So this clearly looks to me like an issue with v 1.16.1.

deathlessdd · November 11, 2020, 2:08pm

I havent tested running 2 nodes with the binarys yet so I can’t confirm.

HeroHann · November 11, 2020, 2:12pm

Just a equation for the future. How to go back to a certain version?

nerdatwork · November 11, 2020, 2:24pm

IIRC 1.16.1 adds a secret DB and reverting to previous version might negatively impact your node. I am not sure what the secret DB does though but wanted you to know.

jammerdan · November 11, 2020, 2:25pm

Thanks. But what else can I do. 1.16.1 obviously does not work for this node.

littleskunk · November 11, 2020, 2:42pm

Second node is running on the wrong internal port. The docker run command should be -p 7778:28967 and -p 7779:28967

jammerdan · November 11, 2020, 2:47pm

Both nodes are not running on Docker. They are on the binaries only.

SGC · November 11, 2020, 3:48pm

don’t go backwards in versions… it’s really really dangerous for your databases…
might not kill your node this time around, but you really cannot know what kind of damage it can cause.
tho it may be unlikely to fully kill your node… it might ruin all the databases if you are really unlucky.

common practice is never to roll back software that controls databases unless if the software is certified / built with that feature in mind.

ofc sometimes one doesn’t follow the rules to make stuff work… but then one better understand the risk assessment involved.

deathlessdd · November 11, 2020, 3:57pm

I wouldn’t advice to ever roll back versions espically if your node isn’t new, if your testing who cares but if your running this node long term once you update don’t go back. With the binary files you can easily just delete the binary and replace with a different version.

jammerdan · November 12, 2020, 5:17pm

So it seems to be fixed by now. The issue was that the node ran out of memory processing thousand over thousands of unsent orders.
After manually sorting the good ones from the corrupted ones v 1.16.1 is working without error or restarting.

nurvmokka · November 13, 2020, 8:11am

How can I check the orders? And than sorting the good or the bad ones?

jammerdan · November 13, 2020, 8:28am

You need to locate your orders/unsent folder. It depends on your OS and settings where it is. Normally it is next to the ‘storage’ folder where the blobs are in resp. where your config.yaml is.
Then you move all orders out of the unsent folder into a new folder, I believe you have to stop the node first.
Then you move them back in 1 by 1 or in batches and start up the node.
Upon start, the node tries to send the orders and fails sending if encountering a corrupted order which you then can dispose. This you repeat until all you have identified all corrupted ones

nurvmokka · November 13, 2020, 9:53am

Ok, thanks. I’m running it on a docker… Do you know how I can accomplish that?

jammerdan · November 13, 2020, 10:17am

It is the same procedure with Docker.