Badger cache: are we ready?

As I sugested before, an autodelete of the badger cache if it’s corrupted would be verry useful.

4 Likes

If you want, you can write a script. I am against any kind of auto-deletion.

I lack the scripting skills. :grin:
I can leave without badger cache anyway, I run FW once or twice a year.
The auto deletion is already incorporated in storagenode software… see piece deletion. :grin:

I totally understand the hesitation around auto-deletion—no one wants things removed without a good reason. That said, in this case, it might actually be helpful. When a node can’t start due to a corrupted badger cache, it’s pretty much stuck.

Having a process in place to either fix the corruption (ideally) or, if that’s not possible, delete the cache and regenerate it, could really help. Since deleting the cache doesn’t have any real downsides for the software, aside from needing to recreate it, this seems like a reasonable solution to keep things running smoothly without needing manual intervention every time.

So, while I agree auto-deletions shouldn’t be used lightly, this might be one of those scenarios where it’s actually beneficial. It would be a way to keep things running without manual intervention, ensuring everything stays on track with minimal fuss. Just a thought!

Just imagine, it’s reset to the /, just plain /. And you run it with root. Are you still convinced?
Mistakes are happens. I do not like any auto-deletion in any condition.
If you are so sure - use the script.

I would vote against any auto-deletion in the upstream.

Good point! I totally hear you on the risks—nobody wants to accidentally wipe out something important! But couldn’t we add some safeguards to make sure disasters like that can’t happen?

Also, the software is already deleting customer data from our nodes, so how are the risks of deleting the wrong files here any different?

If it’s not auto-deletion, then we need another solution. I can’t imagine SNOs manually fixing their cache every time it gets corrupted. I foresee longer downtimes. Without a solution, they might not even want to use the cache at all, which would be a pity!

Hopefully we figure out the best way forward.

1 Like

Sure you can check if there are Cache-files existing before you delete/overwrite them.

But is this really a Problem here?
I am Running the badger Cache for weeks now on 35 Nodes and dont see any corruption till now. Looks pretty stable to me.

In times of Poweroutage i randomly See some corrupted filesystems, there is no automatic repair mechanisms too, so i have to repair it manually anyway xd

With that argument, you would never be able to delete anything.

But if that’s really a problem, why not rename the cache folder and re-create on startup of node? No deletion required.

1 Like

Glad to hear your Badger cache is running smoothly! Mine is too! :blush: It’s great when things just work. However, since multiple SNOs have reported corruption issues, I think it’s worth looking into.

3 Likes

I have removed the badger cache enabling setting from my nodes.
Yes it helps with the pieces scan on startup but I don’t need this and I don’t need the extra worry

2 Likes

yeah what I’ve settled into is 3 nodes that are slow . they need all the help they can get. I have badger enabled there and it seems fine. Has helped with a couple of reboots of used filewalkers.

And I have a few fast nodes, which actually seem fine even with the non-badger and also lazy filewalker. So no reason to complicate those.

1 Like

We can delete, if that’s a controlled process. The deleting something on crash just asking for troubles, the process is in its final state. I think it must not delete in that case.
I would prefer a fix instead, and the team is aware of the issue, so hopefully will fix it.

"ERROR failure during run {“Process”: “storagenode”, “error”: “Error opening database on storagenode: Cannot write pid file "/app/config/storage/filestatcache/LOCK" error: open /app/config/storage/filestatcache/LOCK: read-only file”

After a disk failure. I did fsck and remounted. What is the suggested fix in this case?

I guess you may try to remove that file before starting the node.