Put the node into an offline/suspended state when audits are failing

pietro · August 15, 2021, 4:14pm

Following up this thread regarding a problem I had on my node which led to a disqualification, I would kindly ask the development team to add a feature which would put the node “on hold” in case of audit fail to avoid being disqualified. The SNO operator will then have a certain number of days to fix the problem before being disqualified.

Summarising the thread linked above:

I got disqualified on 4 satellites in a matter of hours
I had no clues at all on what was going on because the log was clean, the hardware was responsive, the storage was fine, the node software was up, the Internet connection was up and my monitoring software was able to connect to the node port
Perhaps was a Linux kernel bug which prevented the communication in some way, but neither Storj neither I were able to figure out the root cause
On satellite side they could only see that the node was unable to fulfill the requests

The problems started at 12:41 and I got disqualified on 16:39, after less than 4 hours.

The point is that the satellite knew that something was wrong but the information was not propagated to the SNO who would have fixed the problem if he only knew that the node was suffering.

So the request is to suspend immediately the node in cases like this instead of disqualifying it , to preserve Storj network integrity and at the same time allow the operator to sort out the issue.

Thanks
Pietro

Toyoo · August 15, 2021, 5:14pm

Thank you for putting it explicitly as an idea! There was a discussion about this idea before, but I think it led to no insight…

stefanbenten · August 15, 2021, 5:42pm

May i ask how you monitored the node port? ie. internal monitoring using something like Zabbix within the local network can be troublesome, if for example the external port on your router/firewall got closed.

In that thread it looks also that your kernel and ethernet interfaces definitely had some weird behaviours. Question would be, how should we notify the node, if we can not reach it directly to show it on the dashboard for example? Would email suffice? What period of time for reaction is ok after sending out? Minutes, hours?
There are lot of “edge” cases that can lead to disqualification and it will be impossible to cover/predict all of them. You also have to see the network in its enterity. If we put X% of the network in suspension, we limit the throughput, where it might be just a single bad sector on your disk and a single audit failing over and over again.

I generally like the idea of being a little more careful, but at the same time we can not forget the safety of the network and at some point need to error on the side of safety/health of it.
Looking forward to figure possible steps forward out!

pietro · August 15, 2021, 5:57pm

I’m using the following command from a cloud VM:

nmap -Pn -oG - -p 28967 MY.INTERNET.HOST.NAME | grep -q open

BrightSilence · August 15, 2021, 6:46pm

It’s a difficult balance. Keeping nodes that lost data on the network is dangerous. So any easing of disqualification rules has negative side effects. I do think disqualification can be a little too fast and an earlier suggestion I did would make that a little better. Tuning audit scoring

But to be honest, that’s only a workaround. Instead of having the satellite deal with this, I think the node software itself should do a better job of monitoring itself. We’ve seen a little too frequently that nodes become basically unresponsive to the point where log lines just stop being written even. But yet responsive enough to still accept the audit challenge.

In my opinion, nodes should monitor their own performance and if they are not capable of responding to a request fast enough, they should just terminate. This saves the satellite from having to find out what is going on and having to distinguish between a node running into issues or a node trying to fool the audit system.

As it stands, I can’t vote for this suggestion, because I don’t think it is the right solution. But I definitely recognize the underlying problem and feel like that needs to be solved.

Toyoo · August 15, 2021, 7:02pm

I think this is not relevant to the suggested idea. It should not matter what tool a node operator is using for monitoring if a node is failing audits. Besides, in case a problem like pietro’s happens in the middle of night, no monitoring not equivalent to a human being on-call 24/7 would suffice anyway.

Sure! After all, we’re using mostly unstandardized consumer hardware for that, with each device potentially having its own set of unique failure modes. I believe this should have been expected from the start, and I’m happy to see that Storj understands the issue.

Would be fine to me.

If you are already allowing nodes to be offline for many days, maybe the same amount of time should be given to node operators in case of failed audits? “Minutes” or “hours” sound strange in this context.

From my point of view, the satellite itself is already a kind of a monitoring solution for the nodes. It might just need a bit of tuning, so that node operators are able to respond correctly to faults observed by satellites.

BrightSilence · August 15, 2021, 7:14pm

Yes, but the satellite has incomplete information. All it knows is that the node is not responding to audits on time. It doesn’t know why that is and it can’t know why that is. If the satellite is lenient in those scenarios, that will necessarily create an opening for node operators with bad intentions to abuse. It also doesn’t have the power to take a node offline. Additionally, if it would suspend the node, then when should it accept it back again? After you’ve fixed the problem, you somehow need to signal to the satellite that everything is back to normal and the satellite needs to verify that as it can’t just trust you on your word. This all gets really complex.

On the other hand, the node itself can look at how it is performing. It can monitor response times and implement timeouts for specific operations. It can then also decide to terminate itself when there are issues. You are then free to fix the issue and start the node again when it is solved. No need to wait for the satellite to determine your node is ok again. So the node is working with complete information and can take action on that information, without having the satellite being any more open to abuse.

And after all, this is still an issue with the node. The node should take care of that issue, not the satellite.

Toyoo · August 15, 2021, 7:33pm

The way I see it, the node needs an external observer anyway, and whether it’s a custom setup, or a simple, but standardized solution, it’s a secondary matter. What’s more, the satellite is the only entity capable of verifying correctness of audits, which makes it a sole source of important information. I do not deny the need for monitoring at the node’s level, nor maybe some custom monitoring infrastructure that any node operator can set up in parallel. I just believe that satellites are in a good position to become part of the solution. As for the questions you listed, they are indeed important, but they might simply require finding a good trade-off.

stefanbenten · August 15, 2021, 7:34pm

Depending on your firewall setup, this could already be the problem. This can but does not provide a full operation success test. Any sort of “open port” or ping test is not 100% safe to trust.

For exactly the above reason it is relevant. If you ping google.com it does not tell you if the site actually works. In the linked thread the claim was it was fully operational, which was almost certainly not the case.

My first reply had a similar intention than what @BrightSilence said with other words.
The satellite can not tell why your node is not responding in many cases, which leads to the problem that its hard to decide if
a) it is just offline and does not refuse the audit for some reason
b) is online but has technical problems with the hardware
c) is online and maliciously trying to abuse/game the system.
d) is online and innocent and the issue is somewhere along the way.

Putting nodes into suspension mode for several days also has the caveat that it would still get paid for the data stored, etc. If the node is malicious and on purpose not responding to audits and just gets put into suspension for several days, it opens more and more possibility to claim money for data that might not be there any more.
Of course, i do understand that this puts a hard burden on the SNO’s to monitor their nodes very closely, but at the same time i do think, that a decently old node pays very reasonable for a little bit of monitoring.

Which brings me to a much more powerful suggestion/idea. Rather than making the satellite be less strict, we should just improve the way one can monitor their nodes using full stack tests (ie. uploading like a couple of KB to it, downloading it shortly after and delete it). Depending on how far one wants to drive it, the MND could possibly play a role here and do parts of this?

BrightSilence · August 15, 2021, 7:50pm

Now that would certainly get my vote. Though we should keep in mind that many SNOs would run this on the same system, so if system responsiveness becomes an issue, that may also impact the full stack monitoring solution. Food for thought, but I like this direction a lot more.

Toyoo · August 15, 2021, 8:09pm

Please take no offence, but to me this looks like you’re trying to debug pietro’s original issue, and not respond to the idea stated. You are even stating “in the linked thread”. This thread is about a more general idea. While this general idea has its origin in pietro’s issue, for clarity of discussion it would be best to separate the two threads.

stefanbenten · August 15, 2021, 8:29pm

I am trying to explain why the proposal here is not a good approach to handle node audit failures.

I highlighted his assumptions to make it more obvious where i would like to start to improve/work on.

pietro · August 15, 2021, 8:29pm

@BrightSilence , I’m not talking about bad nodes, they deserve disqualification, I’m talking about virtuous node operators who happen to have a little problem once every two years which could be easily resolved with a simple reboot. They do not deserve a permanent disqualification, as virtuous students with all A’s do not deserve to lose a year for only one F in one entire year of school.

stefanbenten · August 15, 2021, 8:31pm

Fully agreeing with this! However this proposal has the drawback of allowing also “bad” nodes to make use of it. Its hard to distinct here.

littleskunk · August 15, 2021, 8:37pm

How do you tell the difference? How should the satellite know which operators are bad and need punishment and which don’t?

It makes a huge difference if a bad node gets disqualified or suspended. The whole idea behind disqualification is to prevent bad nodes from even trying. If we would suspend them first that means they can get away with it. They can simply delete 1% of their hard drive every week until they get suspended and then wait for the system to unsuspend them. Keep at that level for maximum profit. You see we can’t implement something that would allow bad nodes to game us.

SGC · August 15, 2021, 8:37pm

in sense i agree with bright, but i also think one could simply make a script for it… even if it comes with a bit more overhead.

so then imho the question really is if storjlabs want to deal with the issue head on, or push it until they are forced to deal.

having nodes silently running such a system cannot be worse than what storjlabs can conjure up.

Toyoo · August 15, 2021, 8:39pm

I’m probably missing something here. How is suspending a node leading to bad actors taking advantage?

SGC · August 15, 2021, 8:41pm

its not about bad actors it’s about the network’s data integrity.

pietro · August 15, 2021, 8:42pm

Yes, of course, provided you allow enough time to sort the issue out. Sending an email at 3 a.m. local time and expecting that the issue will be fixed in an hour would be quite useless.

That’s not what I wrote. My node HAD a problem, this is 100% sure, but I had no way to know that a problem existed. Instead Storj knew exactly that there was a problem. That’s the point. The lack of communication.

Do whatever you want, this idea suggestion is just to start the discussion, but, please, enhance the flow of information between satellites and SNOs, for the benefit of SNOs, Storj and your clients. We all have to benefit for SNOs responding quickly to problems.

I understand the problem of malicious nodes, but to be honest before you mentioned it I didn’t even thought about the possibility that some SNOs could do that. Fooling the system is not in my way of working, but I understand that you must prevent this kind of bad behaviours.

Toyoo · August 15, 2021, 8:46pm

Then, if anything, what you are achieving this way is at most explaining why the proposal is not the best way to handle pietro’s case. But then there’s also e.g. this thread or this one. While the latter gained a specific solution, the idea from this thread would improve the situation of all three of these cases with a single approach. And, as you already stated, there might be other failure modes still lurking out there.

That’s why I believe it does have merit to discuss this specific idea away from a single incident.