Disqualified without reason

jammerdan · May 25, 2024, 8:18am

Yes, but then you would have logs and maybe then it would help both sides to understand what was happening.

It is likely that it will happen again, so why not unban him again, tell him to put logs on debug and not delete them and then look what is going to happen.

Alexey · May 25, 2024, 8:19am

They will not tell nothing, why it moved pieces, if they are already moved to the trash (and will be eventually deleted by the trash filewalker).

Will not help, it will be DQ again for lost data in a few hours. We tried. So no hope here.

jammerdan · May 25, 2024, 8:34am

But it was clear that he does not have the logs anymore before you have unbanned him.
So only chance to be able to receive logs at all would have been to tell him that he was going to be unbanned and therefore he should stop deleting logs. And only after that unban him.

Because you see: He did not know that he got unbanned and therefore kept deleting the logs. No matter if the have any useful information or not. If they don’t have useful information you don’t lose anything.

If you plan to unban, give a notice. That’s just fair. In other cases SNOs could be starting to delete files from the disqualified satellite. Then you unban him and he is doomed anyway.

Mitsos · May 25, 2024, 8:48am

To me it sounds like something went wrong during GC. Maybe the node was unstable (bad RAM, bad storage) and corrupted something in the GC that caused the node to delete more than it should. In either case, unbanning and giving notice wouldn’t have helped. The node lost data, so it would have been disqualified anyway. Since this is the only node out of 22500 active nodes that this happened, I’m very inclined to believe that it was something very specific with this node, and why I’m personally leaning towards hardware error.

That being said: Logging can’t be used currently because the node logs way more than it should. For example cancelled uploads (which is normal node behavior) gets logged as error. There is simply no way to watch the logs live and make any sense out of them. That means that even if OP wanted to troubleshoot this, he/she/it can’t do it live, and would still be banned while we all go through the logs.

jammerdan · May 25, 2024, 8:50am

Why live? Redirect them to a file would be my thought.

Mitsos · May 25, 2024, 8:57am

What’s faster? noticing error logs live, or wait for the day to finish so that you can react?

An example: When you restart a node on a slow drive: The node logs a database locked error, but it also starts spamming the log with useless order/upload/download cancelled errors. By the time you notice the database locked, a few seconds have passed.

I’m not saying that it’s bad to see the locked error. I’m saying that important errors get logged as errors. Normal behavior gets logged as info. That way I (as an SNO) can pick what I want to log without needing any extra processing (logs don’t just vanish into thin air, they need to be processed/stored/rotated). For the life of me I can’t understand why certain logging decisions were made that way (with regards to storagenode).

jammerdan · May 25, 2024, 9:09am

He said he deletes logs every day.
So basically he is storing them but throwing them away after 24 hours.
He did not notice the original errors, so there are no logs for that.
But before he got unbanned he would at least have a chance to stop daily deletion and then run his node again. So he would keep the logs.
If the error does not show again, ok fine. If it does show again then he or Storj would be able to check if there is something about that in the logs.
I do not see a reason to do this live. If there is no error shown in the logs then he cannot do anything in realtime anyway. And if there is an error he can check with Storj later what the reason was and if it can be resolved and if they would unban him again if he is not on fault for the error.

Mitsos · May 25, 2024, 9:12am

Could it be that logs grow too much and need to be thrown away?

Don’t always focus on the 1 problem in front of you. Also take into consideration what lead to the problem.

jammerdan · May 25, 2024, 9:18am

Of course.
But look what he wrote:

So there is no reason to believe he would not be able to turn it off.
What is so hard about to ask a user: “Hey, we plan to unban your node, can you turn off your log cleaning for a while so we might see the reason for the strange behavior?”

Mitsos · May 25, 2024, 9:26am

No arguments on my side there. I completely agree.

How about instead of the logs being off in the first place, they were kept under control (by not spamming them with useless messages I can do nothing about to improve) and we now had a:

Critical: Got audit for a piece that was trashed on XXXX-XX-XX.

to work with?

There are RFCs that deal with these things, one of them being RFC 5424 - The Syslog Protocol

jammerdan · May 25, 2024, 9:33am

Yes, I agree.
But I think there have been many suggestions already how to make them less verbose or more precise according to the level you choose.

Recently there was a change for customizing logs, but I have no idea if that would have helped or even how to use it:

Release preparation v1.99

b1eec59 storagenode: make most of the storagenode logs configurable

This would be interesting, yes.

Mitsos · May 25, 2024, 9:39am

I asked about it before, and turns out that that only filters based on what logged it. It doesn’t actually change the facility/priority of the logs. Again, a useless feature if logs were actually implemented according to the relevant RFC.

I’m thinking of working on fixing the logging, but since I’m not a programmer, it may take a while.

Alexey · May 25, 2024, 9:50am

We hoped, that it can be in time to restore data, but unfortunately no. It’s failed and deleted all trashed data.

jammerdan · May 25, 2024, 9:55am

It’s a pity because it was such an unusual event.

Alexey · May 25, 2024, 10:00am

Yep. So, enabling logs wouldn’t help, if the node wouldn’t manage to survive.
We wanted logs when it receives a BF and starting move pieces to the trash to get an idea, why it’s deleted too much?
If it didn’t manage to survive, then it’s useless unfortunately…
That would be to difficult to extract the reason, if the node has also corrupted or lost pieces during the way.