With version 1.16.1 2nd node not working

Second node on same server no longer working after updating to 1.16.1:

Nothing has been changed. 1st node runs on ports 7778/28967, 2nd node on 7779/28968.

First node is still running without problem.

2nd node now is running into an error:
Unrecoverable error {"error": "rpc: dial tcp 127.0.0.1:7779: connect: connection refused", "errorVerbose": "rpc: dial tcp 127.0.0.1:7779: connect: connection refused\n\tstorj.io/common/rpc.TCPConnector.DialContextUnencrypted:107\n\tstorj.io/common/rpc.Dialer.dialTransportUnencrypted:178\n\tstorj.io/common/rpc.Dialer.dialUnencrypted.func1:161\n\tstorj.io/common/rpc/rpcpool.(*Conn).newConn:129\n\tstorj.io/common/rpc/rpcpool.(*Conn).getConn:158\n\tstorj.io/common/rpc/rpcpool.(*Conn).Invoke:194\n\tstorj.io/common/rpc/rpctracing.(*TracingWrapper).Invoke:31\n\tstorj.io/common/pb.(*drpcPieceStoreInspectorClient).Dashboard:1087\n\tmain.(*dashboardClient).dashboard:42\n\tmain.cmdDashboard:70\n\tstorj.io/private/process.cleanup.func1.4:362\n\tstorj.io/private/process.cleanup.func1:380\n\tgithub.com/spf13/cobra.(*Command).execute:842\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:950\n\tgithub.com/spf13/cobra.(*Command).Execute:887\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.ExecCustomDebug:70\n\tmain.main:336\n\truntime.main:204"}

Try basic troubleshooting with the checklist

Make sure you didn’t add or remove space in your command (if its docker)

It is not Docker. And as I have said, I did not change anything except replacing the storagenode and dashboard binaries with the new versions.
Node 1 is running. Node 2 enters a failed state always after

storagenode[9952]: /usr/local/go/src/net/fd_unix.go:172 +0x45
storagenode[9952]: net.(*TCPListener).accept(0xc0001e31e0, 0xc00077c758, 0xc0002d2bb0, 0xc00067c470)
storagenode[9952]: /usr/local/go/src/net/tcpsock_posix.go:139 +0x32
storagenode[9952]: net.(*TCPListener).AcceptTCP(0xc0001e31e0, 0xc00077c7a8, 0xaa1b50, 0xc0002d2b80)
storagenode[9952]: /usr/local/go/src/net/tcpsock.go:248 +0x65
storagenode[9952]: storj.io/storj/pkg/server.(*userTimeoutListener).Accept(0xc0000b6088, 0x0, 0x0, 0x0, 0x0)
storagenode[9952]: /go/src/storj.io/storj/pkg/server/listener.go:40 +0x32
storagenode[9952]: storj.io/storj/pkg/listenmux.(*Mux).monitorBase(0xc0001e00e0)
storagenode[9952]: /go/src/storj.io/storj/pkg/listenmux/mux.go:112 +0x65
storagenode[9952]: created by storj.io/storj/pkg/listenmux.(*Mux).Run
storagenode[9952]: /go/src/storj.io/storj/pkg/listenmux/mux.go:88 +0xe5
storagenode[9952]: goroutine 1286 [select]:
storagenode[9952]: storj.io/common/sync2.Sleep(0x107c5c0, 0xc0003d3350, 0x24852b99d, 0xc000648000)
storagenode[9952]: /go/pkg/mod/storj.io/common@v0.0.0-20201014090530-c4af8e54d5c4/sync2/sleep.go:16 +0x115
storagenode[9952]: storj.io/storj/storagenode/nodestats.(*Cache).sleep(0xc0002883f0, 0x107c5c0, 0xc0003d3350, 0xc000ad2ca0, 0xc000ad2cf0)
storagenode[9952]: /go/src/storj.io/storj/storagenode/nodestats/cache.go:241 +0x5d
storagenode[9952]: storj.io/storj/storagenode/nodestats.(*Cache).Run.func2(0x107c5c0, 0xc0003d3350, 0x107c5c0, 0xc0003d3350)
storagenode[9952]: /go/src/storj.io/storj/storagenode/nodestats/cache.go:115 +0x5e
storagenode[9952]: storj.io/common/sync2.(*Cycle).Run(0xc0001889c0, 0x107c2c0, 0xc00034f7c0, 0xc0003d3320, 0x0, 0x0)
storagenode[9952]: /go/pkg/mod/storj.io/common@v0.0.0-20201014090530-c4af8e54d5c4/sync2/cycle.go:92 +0x168
storagenode[9952]: storj.io/common/sync2.(*Cycle).Start.func1(0xc000072aa4, 0xf9a808)
storagenode[9952]: /go/pkg/mod/storj.io/common@v0.0.0-20201014090530-c4af8e54d5c4/sync2/cycle.go:71 +0x45
storagenode[9952]: golang.org/x/sync/errgroup.(*Group).Go.func1(0xc00071e1b0, 0xc0008521e0)
storagenode[9952]: /go/pkg/mod/golang.org/x/sync@v0.0.0-20200625203802-6e8e738ad208/errgroup/errgroup.go:57 +0x59
storagenode[9952]: created by golang.org/x/sync/errgroup.(*Group).Go
storagenode[9952]: /go/pkg/mod/golang.org/x/sync@v0.0.0-20200625203802-6e8e738ad208/errgroup/errgroup.go:54 +0x66
systemd[1]: storj01.service: main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: Unit storj01.service entered failed state.

These are new log lines I have not seen before v 1.16.1.

How could this only affect node 2 and not also node 1?

Is it possible to switch back to v 1.15.3 ?

Anyone has an idea? The node keeps erroring and is basically offline.

Yes you can go back to 1.15.3 I tested with my node with no issues that I seen. But probably not recommended since it does add a new database.

Thanks. I have just tested.
Now with v1.15.3 the 2nd node runs normally as before.

@Alexey, @nerdatwork
So this clearly looks to me like an issue with v 1.16.1.

I havent tested running 2 nodes with the binarys yet so I can’t confirm.

Just a equation for the future. How to go back to a certain version?

IIRC 1.16.1 adds a secret DB and reverting to previous version might negatively impact your node. I am not sure what the secret DB does though but wanted you to know.

Thanks. But what else can I do. 1.16.1 obviously does not work for this node.

Second node is running on the wrong internal port. The docker run command should be -p 7778:28967 and -p 7779:28967

Both nodes are not running on Docker. They are on the binaries only.

don’t go backwards in versions… it’s really really dangerous for your databases…
might not kill your node this time around, but you really cannot know what kind of damage it can cause.
tho it may be unlikely to fully kill your node… it might ruin all the databases if you are really unlucky.

common practice is never to roll back software that controls databases unless if the software is certified / built with that feature in mind.

ofc sometimes one doesn’t follow the rules to make stuff work… :smiley: but then one better understand the risk assessment involved.

1 Like

I wouldn’t advice to ever roll back versions espically if your node isn’t new, if your testing who cares but if your running this node long term once you update don’t go back. With the binary files you can easily just delete the binary and replace with a different version.

2 Likes

So it seems to be fixed by now. The issue was that the node ran out of memory processing thousand over thousands of unsent orders.
After manually sorting the good ones from the corrupted ones v 1.16.1 is working without error or restarting.

2 Likes

How can I check the orders? And than sorting the good or the bad ones?

You need to locate your orders/unsent folder. It depends on your OS and settings where it is. Normally it is next to the ‘storage’ folder where the blobs are in resp. where your config.yaml is.
Then you move all orders out of the unsent folder into a new folder, I believe you have to stop the node first.
Then you move them back in 1 by 1 or in batches and start up the node.
Upon start, the node tries to send the orders and fails sending if encountering a corrupted order which you then can dispose. This you repeat until all you have identified all corrupted ones

Ok, thanks. I’m running it on a docker… Do you know how I can accomplish that?

It is the same procedure with Docker.