Badger cache: are we ready?

snorkel · September 11, 2024, 9:09pm

As I sugested before, an autodelete of the badger cache if it’s corrupted would be verry useful.

Alexey · September 12, 2024, 4:07am

If you want, you can write a script. I am against any kind of auto-deletion.

snorkel · September 12, 2024, 7:17am

I lack the scripting skills.
I can leave without badger cache anyway, I run FW once or twice a year.
The auto deletion is already incorporated in storagenode software… see piece deletion.

edo · September 12, 2024, 7:26am

I totally understand the hesitation around auto-deletion—no one wants things removed without a good reason. That said, in this case, it might actually be helpful. When a node can’t start due to a corrupted badger cache, it’s pretty much stuck.

Having a process in place to either fix the corruption (ideally) or, if that’s not possible, delete the cache and regenerate it, could really help. Since deleting the cache doesn’t have any real downsides for the software, aside from needing to recreate it, this seems like a reasonable solution to keep things running smoothly without needing manual intervention every time.

So, while I agree auto-deletions shouldn’t be used lightly, this might be one of those scenarios where it’s actually beneficial. It would be a way to keep things running without manual intervention, ensuring everything stays on track with minimal fuss. Just a thought!

Alexey · September 12, 2024, 7:42am

Just imagine, it’s reset to the /, just plain /. And you run it with root. Are you still convinced?
Mistakes are happens. I do not like any auto-deletion in any condition.
If you are so sure - use the script.

I would vote against any auto-deletion in the upstream.

edo · September 12, 2024, 9:25am

Good point! I totally hear you on the risks—nobody wants to accidentally wipe out something important! But couldn’t we add some safeguards to make sure disasters like that can’t happen?

Also, the software is already deleting customer data from our nodes, so how are the risks of deleting the wrong files here any different?

If it’s not auto-deletion, then we need another solution. I can’t imagine SNOs manually fixing their cache every time it gets corrupted. I foresee longer downtimes. Without a solution, they might not even want to use the cache at all, which would be a pity!

Hopefully we figure out the best way forward.

unrealSpeedy · September 12, 2024, 9:37am

Sure you can check if there are Cache-files existing before you delete/overwrite them.

But is this really a Problem here?
I am Running the badger Cache for weeks now on 35 Nodes and dont see any corruption till now. Looks pretty stable to me.

In times of Poweroutage i randomly See some corrupted filesystems, there is no automatic repair mechanisms too, so i have to repair it manually anyway xd

jammerdan · September 12, 2024, 9:40am

With that argument, you would never be able to delete anything.

But if that’s really a problem, why not rename the cache folder and re-create on startup of node? No deletion required.

edo · September 12, 2024, 9:45am

Glad to hear your Badger cache is running smoothly! Mine is too! It’s great when things just work. However, since multiple SNOs have reported corruption issues, I think it’s worth looking into.

andrew2.hart · September 12, 2024, 11:31am

I have removed the badger cache enabling setting from my nodes.
Yes it helps with the pieces scan on startup but I don’t need this and I don’t need the extra worry

EasyRhino · September 12, 2024, 3:22pm

yeah what I’ve settled into is 3 nodes that are slow . they need all the help they can get. I have badger enabled there and it seems fine. Has helped with a couple of reboots of used filewalkers.

And I have a few fast nodes, which actually seem fine even with the non-badger and also lazy filewalker. So no reason to complicate those.

Alexey · September 13, 2024, 6:35am

We can delete, if that’s a controlled process. The deleting something on crash just asking for troubles, the process is in its final state. I think it must not delete in that case.
I would prefer a fix instead, and the team is aware of the issue, so hopefully will fix it.

agente · September 14, 2024, 6:46am

"ERROR failure during run {“Process”: “storagenode”, “error”: “Error opening database on storagenode: Cannot write pid file "/app/config/storage/filestatcache/LOCK" error: open /app/config/storage/filestatcache/LOCK: read-only file”

After a disk failure. I did fsck and remounted. What is the suggested fix in this case?

Alexey · September 14, 2024, 6:56am

I guess you may try to remove that file before starting the node.

surfercool · September 20, 2024, 4:35pm

Now i have an error with the badger …
Node just stop. No error in info visible.

Its impossible to see the error in log. Debug Level is necessary

2024-09-20T18:26:22+02:00 INFO Configuration loaded {“Location”: “A:\Storj\config.yaml”}
2024-09-20T18:26:22+02:00 INFO Anonymized tracing enabled
2024-09-20T18:26:22+02:00 DEBUG tracing collector started
2024-09-20T18:26:22+02:00 DEBUG debug server listening on 127.0.0.1:55225
2024-09-20T18:26:22+02:00 INFO Operator email {“Address”: “mail”}
2024-09-20T18:26:22+02:00 INFO Operator wallet {“Address”: “0x7E”}

2024-09-20T18:26:22+02:00 DEBUG db.filestatcache Got error while calculating total size of directory: I:\90_Storj\filestatcache

2024-09-20T18:26:49+02:00 INFO Configuration loaded {“Location”: “A:\Storj\config.yaml”}
2024-09-20T18:26:49+02:00 INFO Anonymized tracing enabled
2024-09-20T18:26:49+02:00 DEBUG tracing collector started
2024-09-20T18:26:49+02:00 DEBUG debug server listening on 127.0.0.1:55386

I switched it off.

Alexey · September 21, 2024, 6:27am

You may also just to delete it.

Mark · September 21, 2024, 3:58pm

This morning I woke up and saw that I was suspended and my node was in a restart loop.
badger-sus

The docker log says: “fatal error: sync: unlock of unlocked mutex” I renamed the filestatcache folder and things seem ok for now. I am no longer suspended.
Any idea what might cause this problem? It does not seem like a common issue. Maybe a hardware problem? This is a raspberry pi 4. I’ve been thinking about getting a pi 5 with more ram. Not sure if that would make a difference. I like badger but it seems to make my node less reliable.

Edit: Full Disclosure. I experiment with my nodes more than I should so maybe I broke something, but I doubt I could cause a Mutex error.

fatal error: sync: unlock of unlocked mutex

goroutine 1117 [running]:
sync.fatal({0x1f7f259?, 0x43c23c?})
	/usr/local/go/src/runtime/panic.go:1007 +0x20
sync.(*Mutex).unlockSlow(0x4000d3a240, 0xffffffff)
	/usr/local/go/src/sync/mutex.go:229 +0x38
sync.(*Mutex).Unlock(0x400646b880?)
	/usr/local/go/src/sync/mutex.go:223 +0x58
panic({0x1de0a00?, 0x400bbadfe0?})
	/usr/local/go/src/runtime/panic.go:770 +0x124
github.com/spacemonkeygo/monkit/v3.newSpan.func1(0x4006616200)
	/go/pkg/mod/github.com/spacemonkeygo/monkit/v3@v3.0.23/ctx.go:155 +0x2c8
panic({0x1de0a00?, 0x400bbadfe0?})
	/usr/local/go/src/runtime/panic.go:770 +0x124
github.com/spacemonkeygo/monkit/v3.newSpan.func1(0x4006616250)
	/go/pkg/mod/github.com/spacemonkeygo/monkit/v3@v3.0.23/ctx.go:155 +0x2c8
panic({0x1de0a00?, 0x400bbadfe0?})
	/usr/local/go/src/runtime/panic.go:770 +0x124
github.com/spacemonkeygo/monkit/v3.newSpan.func1(0x4006616280)
	/go/pkg/mod/github.com/spacemonkeygo/monkit/v3@v3.0.23/ctx.go:155 +0x2c8
panic({0x1de0a00?, 0x400bbadfe0?})
	/usr/local/go/src/runtime/panic.go:770 +0x124
github.com/spacemonkeygo/monkit/v3.newSpan.func1(0x4006616450)
	/go/pkg/mod/github.com/spacemonkeygo/monkit/v3@v3.0.23/ctx.go:155 +0x2c8
panic({0x1de0a00?, 0x400bbadfe0?})
	/usr/local/go/src/runtime/panic.go:770 +0x124
github.com/spacemonkeygo/monkit/v3.newSpan.func1(0x4006616470)
	/go/pkg/mod/github.com/spacemonkeygo/monkit/v3@v3.0.23/ctx.go:155 +0x2c8
panic({0x1de0a00?, 0x400bbadfe0?})
	/usr/local/go/src/runtime/panic.go:770 +0x124
github.com/spacemonkeygo/monkit/v3.newSpan.func1(0x4006616490)
	/go/pkg/mod/github.com/spacemonkeygo/monkit/v3@v3.0.23/ctx.go:155 +0x2c8
panic({0x1de0a00?, 0x400bbadfe0?})
	/usr/local/go/src/runtime/panic.go:770 +0x124
github.com/dgraph-io/badger/v4/table.(*blockIterator).setIdx(0x4006f21e88?, 0x400079e1c0?)
	/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/table/iterator.go:77 +0x50c
github.com/dgraph-io/badger/v4/table.(*blockIterator).seek.func1(0x400079e1c0?)
	/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/table/iterator.go:151 +0x60
sort.Search(0x4006f21ec8?, 0x4006f21eb8)
	/usr/local/go/src/sort/search.go:65 +0x48
github.com/dgraph-io/badger/v4/table.(*blockIterator).seek(0x4007ed36d0, {0x4007dbd360?, 0x4006f21f01?, 0x39e?}, 0x39d?)
	/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/table/iterator.go:146 +0xb0
github.com/dgraph-io/badger/v4/table.(*Iterator).seekHelper(0x4007ed36c0, 0x4006f21f88?, {0x4007dbd360, 0x48, 0x48})
	/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/table/iterator.go:262 +0xd0
github.com/dgraph-io/badger/v4/table.(*Iterator).seekFrom(0x4007ed36c0, {0x4007dbd360, 0x48, 0x48}, 0x4000aa0026?)
	/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/table/iterator.go:294 +0xfc
github.com/dgraph-io/badger/v4/table.(*Iterator).seek(...)
	/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/table/iterator.go:310
github.com/dgraph-io/badger/v4/table.(*Iterator).Seek(0x40009305a0?, {0x4007dbd360?, 0x2?, 0x1?})
	/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/table/iterator.go:424 +0x40
github.com/dgraph-io/badger/v4.(*levelHandler).get(0x4000dbc5a0, {0x4007dbd360, 0x48, 0x48})
	/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/level_handler.go:293 +0x220
github.com/dgraph-io/badger/v4.(*levelsController).get(0x40001f0070, {0x4007dbd360, 0x48, 0x48}, {0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, ...}, ...)
	/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/levels.go:1610 +0x134
github.com/dgraph-io/badger/v4.(*DB).get(0x4000d5a908, {0x4007dbd360, 0x48, 0x48})
	/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/db.go:785 +0x364
github.com/dgraph-io/badger/v4.(*Txn).Get(0x4007fee100, {0x4007feae80, 0x40, 0x40})
	/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.0/txn.go:479 +0x234
storj.io/storj/storagenode/blobstore/statcache.(*BadgerCache).Get(0x4007fc32c0?, {0x2435fe0?, 0x400093c640?}, {0x40063f58e0, 0x20, 0x20}, {0x4007feadc0, 0x20, 0xede804ec2?})
	/go/src/storj.io/storj/storagenode/blobstore/statcache/badger.go:43 +0xe8
storj.io/storj/storagenode/blobstore/statcache.BlobInfo.Stat({{0x2436248?, 0x4007fc3560?}, {0x2435ec8?, 0x400011a250?}}, {0x2435fe0, 0x400093c960})
	/go/src/storj.io/storj/storagenode/blobstore/statcache/statcache.go:51 +0x11c
storj.io/storj/storagenode/pieces.storedPieceAccess.ModTime({{0x24362f0, 0x4007f8bd80}, {0x33, 0xdc, 0x2f, 0xe7, 0xab, 0x9c, 0x90, 0x96, ...}, ...}, ...)
	/go/src/storj.io/storj/storagenode/pieces/store.go:921 +0x3c
storj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash.func2({0x2440d60, 0x4007fe8580})
	/go/src/storj.io/storj/storagenode/pieces/filewalker.go:214 +0x39c
storj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces.func1({0x24362f0, 0x4007f8bd80})
	/go/src/storj.io/storj/storagenode/pieces/filewalker.go:66 +0xf4
storj.io/storj/storagenode/blobstore/statcache.(*CachedStatBlobstore).WalkNamespace.func1({0x2436248?, 0x4007fc3560?})
	/go/src/storj.io/storj/storagenode/blobstore/statcache/statcache.go:87 +0x78
storj.io/storj/storagenode/blobstore/filestore.walkNamespaceWithPrefix({0x2435fe0, 0x400093d360}, {0x40063f58e0, 0x20, 0x20}, {0x400692a910?, 0x4007c66e20?}, {0x40074707a2, 0x2}, 0x4007142be8)
	/go/src/storj.io/storj/storagenode/blobstore/filestore/dir.go:1027 +0x2b8
storj.io/storj/storagenode/blobstore/filestore.(*Dir).walkNamespaceUnderPath(0x400012c060, {0x2435fe0, 0x400093d360}, {0x40063f58e0, 0x20, 0x20}, {0x400692a910, 0x49}, {0x0, 0x0}, ...)
	/go/src/storj.io/storj/storagenode/blobstore/filestore/dir.go:897 +0x344
storj.io/storj/storagenode/blobstore/filestore.(*Dir).walkNamespaceInPath(0x400012c060, {0x2435fe0, 0x400093d360}, {0x40063f58e0, 0x20, 0x20}, {0x4000bce018, 0x14}, {0x0, 0x0}, ...)
	/go/src/storj.io/storj/storagenode/blobstore/filestore/dir.go:862 +0x188
storj.io/storj/storagenode/blobstore/filestore.(*Dir).WalkNamespace(0x400012c060, {0x2435fe0, 0x400093d180}, {0x40063f58e0, 0x20, 0x20}, {0x0, 0x0}, 0x4007142be8)
	/go/src/storj.io/storj/storagenode/blobstore/filestore/dir.go:855 +0x134
storj.io/storj/storagenode/blobstore/filestore.(*blobStore).WalkNamespace(0x2435fe0?, {0x2435fe0?, 0x400093cfa0?}, {0x40063f58e0?, 0x0?, 0x0?}, {0x0?, 0x7eceddeea8?}, 0x18?)
	/go/src/storj.io/storj/storagenode/blobstore/filestore/store.go:329 +0x30
storj.io/storj/storagenode/blobstore/statcache.(*CachedStatBlobstore).WalkNamespace(0x40008476b0, {0x2435fe0, 0x400093cfa0}, {0x40063f58e0, 0x20, 0x20}, {0x0, 0x0}, 0x4007142bd0)
	/go/src/storj.io/storj/storagenode/blobstore/statcache/statcache.go:86 +0xbc
storj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces(0x40001574c0, {0x2435fe0, 0x400093cfa0}, {0xa2, 0x8b, 0x4f, 0x4, 0xe1, 0xb, 0xae, ...}, ...)
	/go/src/storj.io/storj/storagenode/pieces/filewalker.go:54 +0x198
storj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash(0x40001574c0, {0x2435fe0, 0x400093c820}, {0xa2, 0x8b, 0x4f, 0x4, 0xe1, 0xb, 0xae, ...}, ...)
	/go/src/storj.io/storj/storagenode/pieces/filewalker.go:181 +0x7d8
storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash(0x4000985500, {0x2435fe0, 0x400093c820}, {0xa2, 0x8b, 0x4f, 0x4, 0xe1, 0xb, 0xae, ...}, ...)
	/go/src/storj.io/storj/storagenode/pieces/store.go:582 +0x36c
storj.io/storj/storagenode/retain.(*Service).retainPieces(0x40009e41e0, {0x2435cc8, 0x40068139a0}, {{0x0, 0x0}, {0xa2, 0x8b, 0x4f, 0x4, 0xe1, ...}, ...})
	/go/src/storj.io/storj/storagenode/retain/retain.go:380 +0x734
storj.io/storj/storagenode/retain.(*Service).Run.func2()
	/go/src/storj.io/storj/storagenode/retain/retain.go:265 +0x18c
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78 +0x58
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 1100
	/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:75 +0x98

JWvdV · September 21, 2024, 11:45pm

For the same money, just buy a mini-PC with x64 like an N100 with 8-16GB RAM, most of them also have an SSD and at least also a case.

Must be going on quite a while, if you’re node was suspended. Never got an earlier message about being offline?

No, rather send to be a software problem. But we probably never will know what, since you already removed the files (which is a good solution to start with, so no criticism).

Roxor · September 21, 2024, 11:52pm

+1. A RPi is an excellent way to connect GPIO pins to an IP network. It’s the best option for many projects! But it’s a craptacular general-purpose computer - x64 delivers much better bang-for-your-buck for running Storj nodes.

Mark · September 22, 2024, 1:24am

I’m having trouble finding a N100 mini PC for under $100 USD. A pi 5(with 8GB Ram) is about $80, maybe $92 after the power supply, but with no case. Most of the n100 I’m seeing go for about $150 or more. Let me know if I’m missing something. We might have to split the thread though for this topic.

I think it all happened fairly quickly.
I received an offline email also but it arrived 15 minutes after the suspension email. which arrived about 4 hours after my node first started crashing.

Well, technically, I renamed the badger filestatcache folder which caused the node to make a new one, so I still have the old data but I don’t have the skills to debug it. I’m thinking badger was never designed for or tested on little potato computers with ARM processors but who knows… Although, Badger has been working a lot better for me ever since I upgraded the Pi OS from 32 to 64 bit, but still, I don’t think I will be leaving it enabled knowing it will probably still crash my node again at some random time in the future.