Node database corrupted on update

s.thompson · March 31, 2020, 8:51pm

I’ve been running a node on Linux since October, and have had no problems until today. The dashboard wouldn’t load, so I had a poke at docker. It appears that when attempting to load the container after an update there are missing records from the database.

I find the prospect of it being a result of bit-rot vanishlingly unlikely - the storage location is my fileserver - all the storage is redundant and regularly scrubbed. All the data is local (no NFS etc).

An integrity check of all the sqlite databases seems to show they’re ok - it seems unlikely that random corruption would have resulted in a database which is valid but missing records.

I’ve pasted the log from restart below. Anyone got any ideas? I’m reticent to ditch the whole thing and make a new node without some assurance that things won’t go south again in 4 months time (and my node was getting enough traffic to cover the cost of my 1000/1000 internet connection here in the sunny UK).

Many thanks

2020-03-31T20:43:13.712Z	INFO	storagenode/peer.go:452	Node <NODE_ID> started
2020-03-31T20:43:13.712Z	INFO	storagenode/peer.go:453	Public server started on [::]:28967
2020-03-31T20:43:13.712Z	INFO	storagenode/peer.go:454	Private server started on 127.0.0.1:7778
2020-03-31T20:43:13.713Z	ERROR	piecestore:cacheUpdate	pieces/cache.go:49	error getting current space used calculation: 	{"error": "context canceled"}
storj.io/storj/storagenode/pieces.(*CacheService).Run
	/go/src/storj.io/storj/storagenode/pieces/cache.go:49
storj.io/storj/storagenode.(*Peer).Run.func7
	/go/src/storj.io/storj/storagenode/peer.go:439
golang.org/x/sync/errgroup.(*Group).Go.func1
	/go/pkg/mod/golang.org/x/sync@v0.0.0-20190423024810-112230192c58/errgroup/errgroup.go:57
2020-03-31T20:43:13.713Z	ERROR	piecestore:cacheUpdate	pieces/cache.go:60	error during init space usage db: 	{"error": "piece space used error: no such table: piece_space_used", "errorVerbose": "piece space used error: no such table: piece_space_used\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceSpaceUsedDB).Init:49\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:59\n\tstorj.io/storj/storagenode.(*Peer).Run.func7:439\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
storj.io/storj/storagenode/pieces.(*CacheService).Run
	/go/src/storj.io/storj/storagenode/pieces/cache.go:60
storj.io/storj/storagenode.(*Peer).Run.func7
	/go/src/storj.io/storj/storagenode/peer.go:439
golang.org/x/sync/errgroup.(*Group).Go.func1
	/go/pkg/mod/golang.org/x/sync@v0.0.0-20190423024810-112230192c58/errgroup/errgroup.go:57
2020-03-31T20:43:13.713Z	ERROR	piecestore:cacheUpdate	pieces/cache.go:69	error persisting cache totals to the database: 	{"error": "piece space used error: context canceled", "errorVerbose": "piece space used error: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceSpaceUsedDB).UpdateTotal:121\n\tstorj.io/storj/storagenode/pieces.(*CacheService).PersistCacheTotals:82\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:68\n\tstorj.io/storj/internal/sync2.(*Cycle).Run:87\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:63\n\tstorj.io/storj/storagenode.(*Peer).Run.func7:439\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
storj.io/storj/storagenode/pieces.(*CacheService).Run.func1
	/go/src/storj.io/storj/storagenode/pieces/cache.go:69
storj.io/storj/internal/sync2.(*Cycle).Run
	/go/src/storj.io/storj/internal/sync2/cycle.go:87
storj.io/storj/storagenode/pieces.(*CacheService).Run
	/go/src/storj.io/storj/storagenode/pieces/cache.go:63
storj.io/storj/storagenode.(*Peer).Run.func7
	/go/src/storj.io/storj/storagenode/peer.go:439
golang.org/x/sync/errgroup.(*Group).Go.func1
	/go/pkg/mod/golang.org/x/sync@v0.0.0-20190423024810-112230192c58/errgroup/errgroup.go:57
2020-03-31T20:43:13.713Z	ERROR	orders	orders/service.go:156	listing orders	{"error": "ordersdb error: no such table: unsent_order", "errorVerbose": "ordersdb error: no such table: unsent_order\n\tstorj.io/storj/storagenode/storagenodedb.(*ordersDB).ListUnsentBySatellite:140\n\tstorj.io/storj/storagenode/orders.(*Service).sendOrders:153\n\tstorj.io/storj/internal/sync2.(*Cycle).Run:87\n\tstorj.io/storj/internal/sync2.(*Cycle).Start.func1:68\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
storj.io/storj/storagenode/orders.(*Service).sendOrders
	/go/src/storj.io/storj/storagenode/orders/service.go:156
storj.io/storj/internal/sync2.(*Cycle).Run
	/go/src/storj.io/storj/internal/sync2/cycle.go:87
storj.io/storj/internal/sync2.(*Cycle).Start.func1
	/go/src/storj.io/storj/internal/sync2/cycle.go:68
golang.org/x/sync/errgroup.(*Group).Go.func1
	/go/pkg/mod/golang.org/x/sync@v0.0.0-20190423024810-112230192c58/errgroup/errgroup.go:57
2020-03-31T20:43:13.715Z	FATAL	process/exec_conf.go:288	Unrecoverable error	{"error": "bandwidthdb error: no such table: bandwidth_usage_rollups", "errorVerbose": "bandwidthdb error: no such table: bandwidth_usage_rollups\n\tstorj.io/storj/storagenode/storagenodedb.(*bandwidthDB).Summary:112\n\tstorj.io/storj/storagenode/storagenodedb.(*bandwidthDB).MonthSummary:79\n\tstorj.io/storj/storagenode/monitor.(*Service).usedBandwidth:174\n\tstorj.io/storj/storagenode/monitor.(*Service).Run:83\n\tstorj.io/storj/storagenode.(*Peer).Run.func6:436\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
storj.io/storj/pkg/process.cleanup.func1
	/go/src/storj.io/storj/pkg/process/exec_conf.go:288
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/cobra@v0.0.3/command.go:762
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/cobra@v0.0.3/command.go:852
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/cobra@v0.0.3/command.go:800
storj.io/storj/pkg/process.Exec
	/go/src/storj.io/storj/pkg/process/exec_conf.go:73
main.main
	/go/src/storj.io/storj/cmd/storagenode/main.go:296
runtime.main
	/usr/local/go/src/runtime/proc.go:203

deathlessdd · March 31, 2020, 9:30pm

There’s not enough information to say what could cause this, Which OS are you running are you running it on unraid, Other VMs, What is Your docker start command look like? How is everything connected together hard drives network hard drives?

s.thompson · April 1, 2020, 6:48am

Debian 10, several other (completely unrelated) docker containers. I’d said all data is local - bunch of SATA hard drives. Never heard of unraid, using ZFS. Fileserver has been stable in this config with zero downtime since Debian 10 came out, and has ~10TB of other data on it which is regularly used and has never suffered corruption. Had a power outage in my area a couple of weeks ago, but I struggle to blame that…

Start command:

docker run -d --restart unless-stopped --stop-timeout 300 -p 28967:28967 -p 14002:14002 -e WALLET="0x487DAd8d791c04df805F7E22034bEC313f44e879" -e EMAIL="***@*********.com" -e ADDRESS="*******.com:28967" -e STORAGE="4.1TB" --mount type=bind,source=/storj,destination=/app/config --mount type=bind,source=/home/sam/.local/share/storj/identity/storagenode,destination=/app/identity --name storagenode storjlabs/storagenode:beta

Is this enough information?

Odmin · April 1, 2020, 7:02am

Does it happen after automatic update via watchtower? or what triggered this issue?

Alexey · April 1, 2020, 7:03am

Hello @s.thompson ,
Welcome to the forum!

Yes, it’s enough information. Could you please stop the storagenode

docker stop -t 300 storagenode 
docker rm storagenode

Then check your filesystem, it’s too slow to load databases, some of them are heavily corrupted (missed tables), or unreadable at all (context canceled).
When your check will be completed, fix all errors on filesystem before processing.
Then try to run the container with updated docker run command: https://documentation.storj.io/setup/cli/storage-node#running-the-storage-node

Then post a fresh errors from log.

s.thompson · April 1, 2020, 2:17pm

I’m not sure I understand what’s happened.

As you suggested, I stopped and removed the docker container, ran a full filesystem repair (which found no errors), and then spun up a new container.

The new container started up normally (which is strange, given that I’d tried spinning up a new container before posting, and this hadn’t worked). It then spat out several hundred near-identical errors in the first 5 minutes (single instance posted below), but now appears to be running normally, and has been seeing sustained ingress at ~30mbps for the last half hour or so.

2020-04-01T13:53:40.857Z	ERROR	piecestore	failed to add order	{"error": "ordersdb error: database is locked", "errorVerbose": "ordersdb error: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*ordersDB).Enqueue:53\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).saveOrder:714\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doUpload:443\n\tstorj.io/storj/storagenode/piecestore.(*drpcEndpoint).Upload:215\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:988\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:199\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:173\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:124\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:161\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

There may be some variability in the speed of the filesystem, as it is hosted on the same zfs vdev as some VMs - when one of the VMs starts doing a lot of I/O it will increase the latency a little but it surprises me if this is enough to cause storj to throw errors? Even when the filesystem is under load I can push random writes at sustained speeds of ~300MB/s onto it.

tl;dr it looks like it’s working, but I’d really like to understand what actually went wrong…

Alexey · April 1, 2020, 8:20pm

This is enough to lose the race more often, so, consider to do not share the volume with high i/o applications/VMs/OSes.

BrightSilence · April 1, 2020, 9:09pm

You may also want to consider adding SSD’s for L2ARC and SLOG. This would likely relieve a lot of the issues you’re seeing. However, I think this is only really worth it if it also benefits your other applications as the cost is relatively high and in the case of SLOG you will need at least 2 SSDs mirrored to ensure no data loss.

s.thompson · April 2, 2020, 9:00am

I’m aware that I’ll see an increased miss rate when I have additional load on the filesystem, but really storj is a means of using superfluous storage capacity I already have - I’m unwilling to spend additional moolah on dedicated hardware.

I have an SSD-based L2ARC already, and did experiment with a SLOG but found the performance benefit minimal as most of my data operations (other than storj) are reads of static datasets - so I canned it to use the spare SSD capacity elsewhere.

BrightSilence · April 2, 2020, 7:55pm

That’s perfectly understandible. You didn’t specify whether you already have 2 or more SSD’s for L2ARC. If you do I think it would be worth it to say sacrifice 10-20% of the L2ARC size for SLOG. You don’t need nearly as much as for L2ARC in most cases. I agree that it’s probably not worth spending more money on.

s.thompson · April 2, 2020, 8:42pm

In my experiments, I gave over about 40% of the L2ARC for a mirrored SLOG, and didn’t see an appreciable difference in write performance. Given that most of my workload is read operations, I allocated the capacity back to L2ARC - I do indeed have 2 SSDs for L2ARC.

BrightSilence · April 2, 2020, 9:14pm

It’ll help mostly with lots of small random writes. Databases can benefit a lot from this. And so can your storagenode. I doubt you’ll need 40% though, this isn’t about a lot of data, just about writing it efficiently to the array.