DEBUG Fatal error: infodb: UNIQUE constraint failed: bandwidth_usage_rollups.interval_start

BrightSilence · July 19, 2019, 3:28pm

Some users are reporting nodes restarting often for seemingly no reason. When turning on debug logging the reason becomes visible.

To turn on debug logging change the log level in config.yaml to debug and restart your node: docker restart -t 300 storagenode

2019-07-19T15:18:09.845Z DEBUG Fatal error: infodb: UNIQUE constraint failed: bandwidth_usage_rollups.interval_start, bandwidth_usage_rollups.satellite_id, bandwidth_usage_rollups.action
storj.io/storj/storagenode/storagenodedb.(*bandwidthdb).Rollup:205
storj.io/storj/internal/sync2.(*Cycle).Run:87
storj.io/storj/storagenode/storagenodedb.(*bandwidthdb).Run:40
storj.io/storj/storagenode.(*Peer).Run.func8:358
golang.org/x/sync/errgroup.(*Group).Go.func1:57

Alexey · July 19, 2019, 5:53pm

A post was split to a new topic: DEBUG Fatal error: infodb: no such table: bandwidth_usage_rollups

BlackDuck · July 19, 2019, 6:42pm

The app tries to write data into the which seems to be already there or like the other post maybe here the
bandwidth_usage_rollups table is also missing.

Alexey · July 19, 2019, 8:00pm

Stop the storagenode docker stop storagenode
run the sqlite3 container (replace the /mnt/storagenode to your path to the data):

docker run --rm -it -v /mnt/storagenode/storage/info.db:/data/info.db sstc/sqlite3 sqlite3 ./info.db`

And execute this SQL script:

SELECT datetime(strftime('%Y-%m-%dT%H:00:00', created_at)) created, bu.satellite_id sat_id, bu.action act , SUM(bu.amount) amt
                    FROM bandwidth_usage bu, bandwidth_usage_rollups bur
                WHERE datetime(created_at) < datetime(CURRENT_TIMESTAMP)
                    AND bur.interval_start = created
                    AND bur.satellite_id = sat_id
                    AND bur.action = act
                GROUP BY created, sat_id, act;

Exit from the sqlite3: .exit
Post the result
Also, better to have an exact date, when your storagenode has been updated.
For the watchtower: docker logs watchtower 2>&1 | grep -A 3 -B 3 Found

The temporary solution is to delete the duplicated rows. You should take a date from the result of the query and insert it to the query below:
Replace the path to yours.

docker stop -t 300 storagenode
docker run --rm -it -v /disk2/storj/storage/info.db:/data/info.db sstc/sqlite3 sqlite3 ./info.db

delete from bandwidth_usage_rollups where interval_start = 'put-the-date-with-time-here';

for example: delete from bandwidth_usage_rollups where interval_start = '2019-07-24 15:00:00'; replace the date to your result!!!
3. .exit
4. docker start storagenode

billstanden · July 19, 2019, 2:49am

I’m seeing an issue similar to the ones reported for v0.15.2 however I’m still seeing the issue with v0.15.3 as well.

The container comes up for only around 30 seconds, during which time the dash actually appears pretty normal with the Last Contact time updating.

Grand total of the logs produced below.

2019-07-19T02:46:56.066Z	INFO	Configuration loaded from: /app/config/config.yaml
2019-07-19T02:46:56.099Z	INFO	Operator email: REDACTED
2019-07-19T02:46:56.099Z	INFO	Operator wallet: REDACTED
2019-07-19T02:46:57.551Z	INFO	running on version v0.15.3
2019-07-19T02:46:57.554Z	INFO	db.migration	Latest Version	{"version": 13}
2019-07-19T02:46:57.556Z	INFO	Node 12txo9zvG5XkqYp3T7QgR1DPcf31zEWgM1R4yvMd3e99GRp17jh started
2019-07-19T02:46:57.556Z	INFO	Public server started on [::]:28967
2019-07-19T02:46:57.556Z	INFO	Private server started on 127.0.0.1:7778
2019-07-19T02:46:57.557Z	INFO	vouchers	Checking vouchers
2019-07-19T02:46:57.565Z	INFO	vouchers	Requesting voucher	{"satellite": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2019-07-19T02:46:57.565Z	INFO	vouchers	Requesting voucher	{"satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2019-07-19T02:46:57.565Z	INFO	vouchers	Requesting voucher	{"satellite": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW"}
2019-07-19T02:46:57.754Z	INFO	running on version v0.15.3
2019-07-19T02:47:00.806Z	INFO	piecestore:monitor	Remaining Bandwidth	{"bytes": 4490899859456}
2019-07-19T02:47:11.202Z	ERROR	vouchers	Error requesting voucher	{"satellite": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "voucher: unable to connect to the satellite: context canceled", "errorVerbose": "voucher: unable to connect to the satellite: context canceled\n\tstorj.io/storj/storagenode/vouchers.(*Service).request:132\n\tstorj.io/storj/storagenode/vouchers.(*Service).Request:116\n\tstorj.io/storj/storagenode/vouchers.(*Service).RunOnce.func1:103\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2019-07-19T02:47:11.202Z	ERROR	vouchers	Error requesting voucher	{"satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "voucher: unable to find satellite on the network: context canceled", "errorVerbose": "voucher: unable to find satellite on the network: context canceled\n\tstorj.io/storj/storagenode/vouchers.(*Service).request:127\n\tstorj.io/storj/storagenode/vouchers.(*Service).Request:116\n\tstorj.io/storj/storagenode/vouchers.(*Service).RunOnce.func1:103\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2019-07-19T02:47:11.203Z	ERROR	vouchers	Error requesting voucher	{"satellite": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "error": "voucher: unable to find satellite on the network: context canceled", "errorVerbose": "voucher: unable to find satellite on the network: context canceled\n\tstorj.io/storj/storagenode/vouchers.(*Service).request:127\n\tstorj.io/storj/storagenode/vouchers.(*Service).Request:116\n\tstorj.io/storj/storagenode/vouchers.(*Service).RunOnce.func1:103\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
ERROR: 2019/07/19 02:47:11 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
ERROR: 2019/07/19 02:47:11 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
ERROR: 2019/07/19 02:47:11 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
ERROR: 2019/07/19 02:47:11 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
ERROR: 2019/07/19 02:47:11 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
ERROR: 2019/07/19 02:47:11 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing

billstanden · July 19, 2019, 3:02am

An extra curiosity, on the first run after a reboot the pattern is slightly different. The node will come up for nearly 5 minutes and report only the following.

2019-07-19T02:54:16.797Z	INFO	Configuration loaded from: /app/config/config.yaml
2019-07-19T02:54:16.829Z	INFO	Operator email: REDACTED
2019-07-19T02:54:16.829Z	INFO	Operator wallet: REDACTED
2019-07-19T02:54:19.714Z	INFO	running on version v0.15.3
2019-07-19T02:54:19.915Z	INFO	db.migration	Latest Version	{"version": 13}
2019-07-19T02:54:19.916Z	INFO	Node 12txo9zvG5XkqYp3T7QgR1DPcf31zEWgM1R4yvMd3e99GRp17jh started
2019-07-19T02:54:19.916Z	INFO	Public server started on [::]:28967
2019-07-19T02:54:19.916Z	INFO	Private server started on 127.0.0.1:7778
2019-07-19T02:54:19.922Z	INFO	vouchers	Checking vouchers
2019-07-19T02:54:19.998Z	INFO	vouchers	Requesting voucher	{"satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2019-07-19T02:54:19.998Z	INFO	vouchers	Requesting voucher	{"satellite": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2019-07-19T02:54:19.998Z	INFO	vouchers	Requesting voucher	{"satellite": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW"}
2019-07-19T02:54:20.109Z	INFO	running on version v0.15.3
2019-07-19T02:54:36.999Z	INFO	vouchers	Voucher request denied. Vetting process not yet complete
2019-07-19T02:54:38.625Z	INFO	vouchers	Voucher request denied. Vetting process not yet complete
2019-07-19T02:54:41.689Z	INFO	vouchers	Voucher request denied. Vetting process not yet complete
ERROR: 2019/07/19 02:59:58 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
ERROR: 2019/07/19 02:59:58 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
ERROR: 2019/07/19 02:59:58 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
ERROR: 2019/07/19 02:59:58 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
ERROR: 2019/07/19 02:59:58 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
ERROR: 2019/07/19 02:59:58 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
ERROR: 2019/07/19 02:59:58 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing

It then falls back into the pattern in the first post, flapping every 30 seconds.

qf3l3k · July 19, 2019, 9:43am

I had same issue… what worked for me was to stop and remove all storj containers (including watchtower). Then remove all docker images for storagenode and watchtower and re-download docker images.
After that it went back to normal and 0.15.3 start working fine.

ifraixedes · July 19, 2019, 11:47am

@qf3l3k Thanks!

@billstanden please, can you try to what @qf3l3k has commented?

billstanden · July 19, 2019, 1:52pm

Same behaviour after a full docker stop, prune and redownload unfortunately.

Alexey · July 19, 2019, 7:16pm

Hello @billstanden,
Welcome to the forum!

Have you removed the container?

billstanden · July 20, 2019, 6:24am

Hi Alexey, yes I have.

Both an individual deletion and rebuild and a complete docker system prune --all as well. Behaviour remains the same.

What’s useful to you for debugging?

Alexey · July 20, 2019, 9:28am

Please, edit your config.yaml to enable debug level of the logging: replace the info to debug, save the configuration file and restart the container:
docker restart -t 300 storagenode

Then search for the FATAL errors: docker logs storagenode 2>&1 | grep -i FATAL
For Powershell: docker logs storagenode 2>&1 | sls fatal

billstanden · July 21, 2019, 2:41am

Cheers Alexey.

Full debug log from a single run below.

2019-07-21T02:39:55.288Z	INFO	Configuration loaded from: /app/config/config.yaml
2019-07-21T02:39:55.289Z	DEBUG	debug server listening on 127.0.0.1:48372
2019-07-21T02:39:55.320Z	INFO	Operator email: REDACTED
2019-07-21T02:39:55.320Z	INFO	Operator wallet: REDACTED
2019-07-21T02:39:55.471Z	DEBUG	Binary Version: v0.15.3 with CommitHash 6217d9690fd7363d1ae35eefc12135ed6286a89f, built at 2019-07-17 18:42:01 +0000 UTC as Release true
2019-07-21T02:39:59.992Z	DEBUG	kademlia:endpoint	Successfully connected with 86.121.41.198:28967
2019-07-21T02:40:01.147Z	INFO	piecestore:monitor	Remaining Bandwidth	{"bytes": 4490899859456}
2019-07-21T02:40:07.597Z	DEBUG	kademlia	connecting to node failed	{"target": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "dial-node": "12MTRXmdASvUJJBWy2ELRSvRevDLxfP3rYaKgad7g33D5bhjkm", "dial-address": "codebits-tplink.ddns.net:28967", "error": "transport error: connection error: desc = \"transport: error while dialing: dial tcp: lookup codebits-tplink.ddns.net on 139.130.4.4:53: no such host\"", "errorVerbose": "transport error: connection error: desc = \"transport: error while dialing: dial tcp: lookup codebits-tplink.ddns.net on 139.130.4.4:53: no such host\"\n\tstorj.io/storj/pkg/transport.(*Transport).DialNode:108\n\tstorj.io/storj/pkg/kademlia/kademliaclient.(*Dialer).dialNode:175\n\tstorj.io/storj/pkg/kademlia/kademliaclient.(*Dialer).Lookup:71\n\tstorj.io/storj/pkg/kademlia.(*peerDiscovery).Run.func1:88"}
2019-07-21T02:40:10.701Z	DEBUG	kademlia	connecting to node failed	{"target": "12txo9zvG5XkqYp3T7QgR1DPcf31zEWgM1R4yvMd3e99GRp17jh", "dial-node": "12thPucpDYkExmzvJ4i8S2XMff2ptowhu5MbhCTpDoKsu1ivMhJ", "dial-address": "storj.minhng99.cloud:28967", "error": "transport error: connection error: desc = \"transport: error while dialing: dial tcp 163.172.143.1:28967: connect: connection refused\"", "errorVerbose": "transport error: connection error: desc = \"transport: error while dialing: dial tcp 163.172.143.1:28967: connect: connection refused\"\n\tstorj.io/storj/pkg/transport.(*Transport).DialNode:108\n\tstorj.io/storj/pkg/kademlia/kademliaclient.(*Dialer).dialNode:175\n\tstorj.io/storj/pkg/kademlia/kademliaclient.(*Dialer).Lookup:71\n\tstorj.io/storj/pkg/kademlia.(*peerDiscovery).Run.func1:88"}
2019-07-21T02:40:11.117Z	ERROR	vouchers	Error requesting voucher	{"satellite": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "voucher: unable to connect to the satellite: context canceled", "errorVerbose": "voucher: unable to connect to the satellite: context canceled\n\tstorj.io/storj/storagenode/vouchers.(*Service).request:132\n\tstorj.io/storj/storagenode/vouchers.(*Service).Request:116\n\tstorj.io/storj/storagenode/vouchers.(*Service).RunOnce.func1:103\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2019-07-21T02:40:11.118Z	ERROR	vouchers	Error requesting voucher	{"satellite": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "error": "voucher: unable to find satellite on the network: context canceled", "errorVerbose": "voucher: unable to find satellite on the network: context canceled\n\tstorj.io/storj/storagenode/vouchers.(*Service).request:127\n\tstorj.io/storj/storagenode/vouchers.(*Service).Request:116\n\tstorj.io/storj/storagenode/vouchers.(*Service).RunOnce.func1:103\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2019-07-21T02:40:11.118Z	ERROR	vouchers	Error requesting voucher	{"satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "voucher: unable to find satellite on the network: context canceled", "errorVerbose": "voucher: unable to find satellite on the network: context canceled\n\tstorj.io/storj/storagenode/vouchers.(*Service).request:127\n\tstorj.io/storj/storagenode/vouchers.(*Service).Request:116\n\tstorj.io/storj/storagenode/vouchers.(*Service).RunOnce.func1:103\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
ERROR: 2019/07/21 02:40:11 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
ERROR: 2019/07/21 02:40:11 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
ERROR: 2019/07/21 02:40:11 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
ERROR: 2019/07/21 02:40:11 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
2019-07-21T02:40:12.120Z	DEBUG	Fatal error: infodb: UNIQUE constraint failed: bandwidth_usage_rollups.interval_start, bandwidth_usage_rollups.satellite_id, bandwidth_usage_rollups.action
	storj.io/storj/storagenode/storagenodedb.(*bandwidthdb).Rollup:205
	storj.io/storj/internal/sync2.(*Cycle).Run:87
	storj.io/storj/storagenode/storagenodedb.(*bandwidthdb).Run:40
	storj.io/storj/storagenode.(*Peer).Run.func8:358
	golang.org/x/sync/errgroup.(*Group).Go.func1:57

billstanden · July 21, 2019, 5:23am

Found the other thread on this ( DEBUG Fatal error: infodb: UNIQUE constraint failed: bandwidth_usage_rollups.interval_start ) and some discussion in the Community Rocket.Chat about it.

Now the Rocket.Chat suggested I should probably only get a handful of results, I got over 10000 stretching back to April.

Pastebin: https://pastebin.com/y4GEfNrJ

Alexey · July 21, 2019, 8:57am

Then a workaround for this error could be to delete a duplicated row.
I need to have a result of this SQL query: DEBUG Fatal error: infodb: UNIQUE constraint failed: bandwidth_usage_rollups.interval_start

billstanden · July 21, 2019, 9:27am

Already linked that output Alexey, that’s what the pastebin link is.

https://pastebin.com/y4GEfNrJ

Alexey · July 21, 2019, 9:55am

For you the fix is (replace the path to data with yours):

docker stop -t 300 storagenode
docker run --rm -it -v /mnt/storj/storage/info.db:/data/info.db sstc/sqlite3 sqlite3 ./info.db

delete from bandwidth_usage_rollups where interval_start = '2019-04-08 13:00:00';

.exit
docker start storagenode

billstanden · July 21, 2019, 12:34pm

No change in result Alexey.

Aren’t I going to need to delete every timestamp in that long list?

Something like:

DELETE FROM bandwidth_usage_rollups WHERE interval_start IN (

SELECT created FROM (

SELECT datetime(strftime('%Y-%m-%dT%H:00:00', created_at)) created, bu.satellite_id sat_id, bu.action act , SUM(bu.amount) amt
                    FROM bandwidth_usage bu, bandwidth_usage_rollups bur
                WHERE datetime(created_at) < datetime(CURRENT_TIMESTAMP)
                    AND bur.interval_start = created
                    AND bur.satellite_id = sat_id
                    AND bur.action = act
                GROUP BY created, sat_id, act
));

Alexey · July 21, 2019, 12:46pm

No!
Just this line. Please, don’t delete other.
Make sure, that node is stopped

billstanden · July 21, 2019, 12:47pm

Noted. That line alone wasn’t sufficient though. Exact same error after deleting it.