My node has been auto-suspended

This night my node has been suspended by 3 satellites with no reason at all.
Node is fully functional, no power loss, connected to UPS, no failures at all.

When i checked logs, many errors “database is locked” were present.
I restarted the node, and everything went fine. Now it’s recovering fast, many PUT and GET are coming in without errors. I hope I will be unsuspended very soon.

This is clearly a design flaw, sqlite is not suitable for multi-threaded applications out of the box, it must be configured properly.
This is serious bacuse nodes cannot be left unattended, I’m forced to schedule a stop/start in the crontab every day.

I think storj team cannot talk about node reputation as long as the node can get auto-suspended by design flaws.

1 Like

It’s a known issue that’s being worked on. In the mean time you can try to vacuum and defrag the databases. This has helped some users, though it won’t eliminate the issue entirely. Hopefully an update will be ready soon to get rid of some of these issues. May I ask, is it the used_serial.db file for you as well? Or orders.db? Those are the two we’ve seen have issues so far.

The locked file was used_serial.db.
Can you confirm these issues won’t affect reputation, until they are fixed?
Thanks

Well that means you don’t expect a response from our side. Ok thats fine for me :smiley:

Currently it can lead to suspension, but not disqualification, that has been paused for now while the issues are being resolved. So it won’t effect anything permanent. Reputation recovers quickly, so I wouldn’t worry about that too much. Storjlabs is actually pretty responsive with this stuff, everything I’m telling you I heard from them.

I doubt that was @niccotnt’s intention. :wink: Feel free to add anything I haven’t mentioned or correct anything I’ve been wrong about.

Sorry english is not my native language, I mean “I think storj team cannot trust the node reputation algorithm, until these issues are fixed.”

I understand that suspension mode is not working great. We are working on a solution. There is a different thread that outline some of the solutions. In this thread I want to explain why we need suspension mode in the meantime even if it is not great. Disabling it would make it worse.

Your node is returning an error on audits. The error might not be correct but we can’t simply ignore it. If we would ignore the audit error that would mean we risk file durability. A repair download or a customer download will hit the same error message. Thanks to suspension mode we are aware of that error rate and can trigger repair before we are danger of losing any files. Many storage node operators also managed to decrease the error rate by calling vacuum. From that perspective suspension mode is working as intended and we need to keep it enabled.

Disqualification after 7 days in suspension is still disabled.

2 Likes

Yes I agree with you, I’m confident you keep disqualification for suspension disabled until the “database locked” issue is resolved. Of course the main goal for operators is to leave the nodes unattended, setting just some basic monitoring, when all the hardware/configuration requirements are met.
I would like to be notified only if some real problem occurs (i.e. disk failures, inodes corruption) and I’d like to run weekly maintenance. Checking node logs every few hours can be frustrating, expecially if I have a daytime job.

Use uptimerobot.com to get notified when your node is offline. You would get email notification and then you can investigate the issue.

It technically is a real problem, it’s just not one that you can fix. Would you rather have your node go into suspension mode quietly without being notified?

I want notifications for suspension mode, but I do not want suspension mode gets triggered for internal issues.

Did you use a real email address in your node setup ? Suspension emails are sent on that email.

If we could all predict which issues were going to pop up, that distinction could be made. But I don’t think Storjlabs has a crystal ball. Whether an issue is caused by a problem on the node system or might be related to a software shortcoming can only be found out if it pops up and is subsequently reported.

When it did, they immediately took steps to prevent nodes from getting disqualified and started to work on a fix. It was already explained why suspension couldn’t just be disabled in the mean time. I understand that it’s a little annoying to be caught in this, but you can’t catch everything in internal testing.

Or as Apple would say it: “We have determined that a small percentage of storagenodes may exhibit problems with database locks”

Difference is, when Apple introduced a bad keyboard design they stuck with it for years across several models. Storjlabs is working on an actual fix right away. It just takes some time to build that, test it internally and push the release.

2 Likes

Same here:
“usedserialsdb error: disk I/O error: The device is not ready.”

@littleskunk

Can you please link this thread you are referring to? Thank you.

We all have to use the same search feature…
But this time I did it for you.

This script is for linux+docker. I’m not sure what your setup is.

1 Like

Thank you very much. I didn’t know what to search for and wouldn’t have known if I found the thread littleskunk has been referring to. I am on Windows 10 but I can adapt the workflow. I have already performed an integrity check about a week ago due to suspected database corruption (which I tried and hopefully fixed for pricing.db) but can perform it again and I can also include a vacuum beforehand.

Thanks again!

My node has been un-suspended yesterday, but this night other email alerts came in…
This is the situation now:

I have no errors in logs, here are some screenshots:


(uptime = 2m because i restarted it)
Any idea? can you provide a link explaining the “suspension” algorithm ? maybe i can understand what’s happening.

There’s a blueprint here that goes into how it works, but it won’t contain specifics about this database locked issue.

Keep in mind that disqualification after suspension isn’t active atm.