Put the node into an offline/suspended state when audits are failing

BrightSilence · August 16, 2021, 8:08am

I know this is not the point of your post, but this wouldn’t happen. If a node sends bad data to a customer, it’s no big deal as the uplink would know which pieces are bad and invalid erasure codes. That’s one of the great things about reed solomon encoding. You know which pieces are bad and this is exactly what audits use to determine which nodes have returned bad data. As long as there are still 29 good pieces, the uplink would work just fine and retrieve the requested data. If there are fewer than 29 good pieces, it will know that pieces are bad and throw an error. It will never return bad data.

Let me just lead with: SNO - Storage Node Operator
So yes, please don’t add logic to my code, my brain is running just fine atm.

I think the point is that the node would have to have a way to do a full stack test. No external monitoring software is going to be able to do that if the node won’t accept uploads and downloads that haven’t been signed by any of the known satellites. So as far as I can tell, changes are required. I’m suggesting to go one step further to also allow the node to trigger that test itself and provide an endpoint that can be monitored by any monitoring software.

You’ve seen the concerns raised about that direction and don’t really provide any response or solution, so I don’t have anything to add. It’s still a bad idea.

So, without quoting specific parts as that would make this post way too long. Your suggestion would have the node suspended and excluded after only 2 failed audits with the current scoring system. But by far the biggest problem is that messing with the audit system doesn’t lead to permanent loss. If someone trying to cheat the system can recover, they will try and find the line. It doesn’t matter if you put them in vetting again. In fact, I’d welcome it so I can make sure one of my nodes is always in vetting to get data from both the vetting and trusted node selection cycles. I could use that system to get out ahead. It also requires significant changes to satellite code, while the satellite doesn’t have any information about what kind of failure the node is happening. It’s a blunt force weapon swinging in the blind.

This would have killed almost all nodes during recent troubles with ap1 and us2 though. I have written such a script to stop the node a long time ago, but never put it into effect as it is really an all or nothing approach. I do find it interesting that you find this unreasonable, but don’t mention the same for running an HA setup. I’d say that’s probably the more unreasonable expectation of the two.

Right, so this is the core issue. And I think most of what we’ve seen recently has some common characteristics. It tends to be that the node becomes unresponsive and doesn’t even log, but apparently does just enough to signal to the satellite that it is still online. I’m with @Alexey that this isn’t all that common, but at the same time, it may be the most common issue that node operators face atm. We’re still dealing with nodes that run into a failure state though and almost certainly not one caused by the node software itself. So ideally we would get better ways to monitor that so we can take better care of our nodes. But I don’t think it’s on Storj to fix the underlying issues here or be more lenient on the requirements.