Badger cache: are we ready?

lovaszl · August 13, 2024, 6:58pm

I have already turned on the badger cache, but on the same disk where node is running. In the meantime, I found a free ssd at home. If I stop the node and copy the contents of the cache to a folder on the ssd and attach it there in the docker parameter, does it continue from there? Or should I start over with the new mount point? Can multiple node caches go to this ssd in separate folders? Thanks

Vadim · August 13, 2024, 7:30pm

I have 108 windows GUI nodes, no docker

pasatmalo · August 13, 2024, 7:31pm

As long as you mount the new folder correctly, it should resume just fine with the already generated cache.

This should also work fine. You cannot point multiple nodes to the same cache, but multiple caches work fine in the same drive (at least I have not encountered any issues yet).

lovaszl · August 13, 2024, 7:41pm

Thanks for the reply, I’ll try to set it up

unrealSpeedy · August 13, 2024, 9:23pm

Nice, I will test it, too.

Some Questions:
→ If the Cache is on SSD and the SSD dies, the Node will survive without Cache, will it?
→ Are there some Estimations about the Size of the cache in Relation to the Node size?

Thx in Advance

andrew2.hart · August 14, 2024, 3:05am

Vadim · August 14, 2024, 6:42am

@elek now I have first problem with budger cache, it looks like it broke on windows update/restart. But this is not main problem. After it broke, node not start any more and do not write any logs that it broken or something. After delete of cache map is started and working again.

nerdatwork · August 14, 2024, 6:48am

You can set your log to debug level to see which messages are logged.

Vadim · August 14, 2024, 6:53am

As I know what the problem, I already fixed it, node should work as much as possible.
Next time I will try.

Alexey · August 14, 2024, 9:04am

Yes, it is. But you would likely need to delete it and restart to fill it again.

peter_linder · August 15, 2024, 8:53am

Does anybody know what to expect from the cache db in a low memory environment?

I have it enabled on around 80 nodes now from about 2 weeks ago, so far everything is running fine and the cache dir is growing nicely. For a large (10TB) node the size seems to end up being around 2GB.

Some of these nodes have only 1GB of RAM though. Since the DB is mostly random access, will I gain any performance even if most of the DB is not in RAM?

I have the option of placing the cache dir on an SSD. I presume in docker using

--mount type=bind,source="/mnt/ssd/filestatcache",destination=/app/config/storage/filestatcache

would be correct?

vovannovig · August 15, 2024, 5:30pm

2-3 days after launching, my nodes started to crash.
When you turn on debugging, there is either nothing in the log, or at most this:

2024-08-15T20:19:24+03:00 DEBUG db.filestatcache First key="{-\xe9\xd7,.\x93_\x19\x18\xc0Xʯ\x8e\xd0\x0f\x05\x81c\x90\bps\x17\xff\x1b\xd0\x00\x00\x00\x005W-Q\xd9_\xb0\x11I\x96\xf4\xfb\xd5%\x9cyX\xcf\x19\xc2\xe7\xe1̄or\xa8{n\xb1\x1c*\xff\xff\xff\xff\xfeJW\x19"

#2

2024-08-15T20:22:30+03:00	DEBUG	db.filestatcache	First key="\xa2\x8bO\x04\xe1\v\xae\x85\xd6\x7fLl\xb8+\xf8\xd4\xc0\xf0\xf4z\x8e\xa7&'RM\xebn\xc0\x00\x00\x00M\xab\xaa\xb2\x05\xad\x9a\xbb:\xf7'\xb9\xbf\x95\xa7L'Q\xbf\xaf9\xe5\xec\x1a\xeb\xa3\xc0\xc1\x18XB\xf9\xff\xff\xff\xff\xfe\xb5\x00\xf7"

2024-08-15T20:22:34+03:00	DEBUG	db.filestatcache	51 tables out of 187 opened in 3.017s

#3

2024-08-15T17:14:58+03:00	DEBUG	db.filestatcache	First key="\xa2\x8bO\x04\xe1\v\xae\x85\xd6\x7fLl\xb8+\xf8\xd4\xc0\xf0\xf4z\x8e\xa7&'RM\xebn\xc0\x00\x00\x00;\x19\xac\xeb[\x84\xad\xc1\b\xb6\xc3Ô\x10\xca\xdcV\x18\xb2\xe2\a2j\xb9\x1b\xda<\x16F\x16\xb7\xf6\xff\xff\xff\xff\xfe\xea\x1b\xbc"

2024-08-15T17:14:59+03:00	ERROR	db.filestatcache	Received err: Opening table: "D:\\filestatcache\\000963.sst" error: failed to initialize table error: failed to read index. error: failed to verify checksum for table: D:\filestatcache\000963.sst error: actual: 2662894890, expected: 431282182 error: checksum mismatch
github.com/dgraph-io/badger/v4/y.init
	/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/y/checksum.go:29
runtime.doInit
	/usr/local/go/src/runtime/proc.go:6527
runtime.doInit
	/usr/local/go/src/runtime/proc.go:6504
runtime.doInit
	/usr/local/go/src/runtime/proc.go:6504
runtime.doInit
	/usr/local/go/src/runtime/proc.go:6504
runtime.doInit
	/usr/local/go/src/runtime/proc.go:6504
runtime.doInit
	/usr/local/go/src/runtime/proc.go:6504
runtime.main
	/usr/local/go/src/runtime/proc.go:233
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1598. Cleaning up...
2024-08-15T17:14:59+03:00	ERROR	failure during run	{"error": "Error opening database on storagenode: Opening table: \"D:\\\\filestatcache\\\\000963.sst\" error: failed to initialize table error: failed to read index. error: failed to verify checksum for table: D:\\filestatcache\\000963.sst error: actual: 2662894890, expected: 431282182 error: checksum mismatch\ngithub.com/dgraph-io/badger/v4/y.init\n\t/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/y/checksum.go:29\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6527\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\n\tstorj.io/storj/storagenode/storagenodedb.cachedBlobstore:231\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.ExecWithCustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "errorVerbose": "Error opening database on storagenode: Opening table: \"D:\\\\filestatcache\\\\000963.sst\" error: failed to initialize table error: failed to read index. error: failed to verify checksum for table: D:\\filestatcache\\000963.sst error: actual: 2662894890, expected: 431282182 error: checksum mismatch\ngithub.com/dgraph-io/badger/v4/y.init\n\t/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/y/checksum.go:29\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6527\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\n\tstorj.io/storj/storagenode/storagenodedb.cachedBlobstore:231\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.ExecWithCustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n\tmain.cmdRun:69\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.ExecWithCustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-08-15T17:14:59+03:00	FATAL	Unrecoverable error	{"error": "Error opening database on storagenode: Opening table: \"D:\\\\filestatcache\\\\000963.sst\" error: failed to initialize table error: failed to read index. error: failed to verify checksum for table: D:\\filestatcache\\000963.sst error: actual: 2662894890, expected: 431282182 error: checksum mismatch\ngithub.com/dgraph-io/badger/v4/y.init\n\t/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/y/checksum.go:29\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6527\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\n\tstorj.io/storj/storagenode/storagenodedb.cachedBlobstore:231\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.ExecWithCustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "errorVerbose": "Error opening database on storagenode: Opening table: \"D:\\\\filestatcache\\\\000963.sst\" error: failed to initialize table error: failed to read index. error: failed to verify checksum for table: D:\\filestatcache\\000963.sst error: actual: 2662894890, expected: 431282182 error: checksum mismatch\ngithub.com/dgraph-io/badger/v4/y.init\n\t/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/y/checksum.go:29\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6527\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6504\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\n\tstorj.io/storj/storagenode/storagenodedb.cachedBlobstore:231\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.ExecWithCustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n\tmain.cmdRun:69\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.ExecWithCustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

Vadim · August 15, 2024, 6:21pm

@elek Budger cache not ending work properly some times, I turn off server from windows to move to new case, after turn on 4 nodes cache is broken nodes not start, after delete of cache files working again.

Alexey · August 16, 2024, 6:34am

Seems like a corruption. I would like to suggest to stop the node, check and fix the filesystem, then remove the cache and restart the node.

vovannovig · August 16, 2024, 6:47am

The file system is fine (there are no problems and there never were), deleting the cache or disabling it solved the problem.
BUT I put the log here for the developers as feedback that there is a problem and it needs to be fixed.

Alexey · August 16, 2024, 7:14am

For that we need to know, what’s caused its corruption?
Did you have restarts or power cuts?

vovannovig · August 16, 2024, 7:21am

There were no power outages.
Some of the nodes stopped after the restart, and some stopped during operation and the service was no longer restored.

Alexey · August 16, 2024, 7:23am

How they were restarted? Due to a some error, or you issued the command, or the server restart or is it restarted by the updater?

vovannovig · August 16, 2024, 7:32am

There were no power outages.
There were no emergency situations.
All restarts were only initialized by me for maintenance purposes.

Alexey · August 16, 2024, 7:55am

Interesting. I have restarted my node with the badger cache enabled and it’s survived the restart.
Something weird.