Got Disqualified from saltlake

jammerdan · August 16, 2021, 8:59am

True for an external script.
What puzzles me is the fact, that the node keeps presenting itself to the satellite as online and responsive while in reality it is not and cannot produce the requested results. So that would be the level where the checks would need to be implemented.

BrightSilence · August 16, 2021, 10:43am

I think all that means is that it’s still in theory listening to the port, so it looks open from the outside. It doesn’t mean the node is able to do anything else.

penfold · August 16, 2021, 12:15pm

Or that the number of SNO’s likely to be impacted would be insignificant to the project as a whole. Remember storj recommend no redundancy but it is you as an SNO who will need to wait month and months - maybe even years - to replace data held on a failed drive.

SGC · August 16, 2021, 1:20pm

i’ve been working on a combination of my docker export and the awk color script.
haven’t gotten it working yet, because it was giving me grief keeping count because i want the counts to last between reboots and the parameters configurable.

but i’m very bad at scripting in linux.

i use the docker log command with time defined for 10minutes, so it exports exactly from like 11:50 to 12:00 and then from 12:00 to 12:10 (to avoid overlap)

this is then appended to a log file.

the new part i’m adding goes between the docker logs command and the appending to log file, to avoid reloading the information multiple times.
the “fuse” part of the script is basically a simplified version of the awk color script, which grabs its count and configuration from an external file, so that a reboot or broken node will not get confused, if it cannot access the file it will default to a preset count in the script.

i am also adding more stuff to it, but those parts will not be part of the fuse script because i want to keep it very lean since it will be running a lot and thus complex scripting is basically a waste of resources.

if anyone want to spend the time and is good at scripting, i would appreciate the help.
been tinkering with it from time to time, but it was really more of a learning project for me, i barely need such a script with all my other hardware failsafe’s.

but can’t hurt, just never really got enough time into it to actually making it 100% working like i wanted it to.

oh yeah and because it runs locally it can just shutdown the node upon issues.
assumed thats a given but didn’t mention it… so

i am also planning for the higher functions of the script to connect to a remote system, to which it reports health checks by attempting to opening url’s to allow it to passthrough firewalls and the urls having like tags or codes for the information its forwarding, then the remote system has a deadman switch, which will trigger a node offline notification or such action.

this would then enter a OpenNMS system which handles most of the advanced notification stuff, like who is on duty and how many notifications to send and when and how important stuff is.

just one of those little projects one sort of comes up with lol

Toyoo · August 16, 2021, 2:46pm

Precisely so! For the network as a whole the fact that, let say, 1‰ of nodes are disqualified due to, let say, a failure mode never accounted for is nothing. It would only affect 10 nodes out of the current 10k, or 1k nodes out of 1M when Storj grows 2 orders of magnitude. So, nothing to worry about.

Alexey · August 20, 2021, 5:19am

Hi!
To complete the request - the node was disqualified due to timeouts, as expected.
I have an excerpt from the satellite logs for your node to confirm this.

We also find a bug - the piece was audited more than 3 times and it should be fixed in coming release, however, this does not change the result - your node would be disqualified anyway unless you rebooted it earlier.

The team also agree that we should find a way to do not disqualify nodes in such circumstances, for example crashing the node instead, however it requires research. Pull requests are welcome!

jammerdan · August 20, 2021, 5:30am

Like this?

Alexey · August 20, 2021, 5:43am

Very likely. It’s triggered when the node placed into containment mode (exactly because of timeouts) and then it’s audited for the same piece several times.
storj/docs/blueprints/audit-containment.md at 2782e000acb023dcd5390b3bad19899cb651b481 · storj/storj · GitHub.

Since we have more than one worker, we introduced a bug.

jammerdan · August 20, 2021, 6:15am

Under these circumstances, where

older and usually reliable nodes have been affected
you admit there is a bug in your code
in which Storj agrees, that these are circumstances a node should not be getting disqualified for, I really believe Storj Labs should un-disqualify the nodes in question.
That would be a fair move.

Alexey · August 20, 2021, 6:25am

It will not be reset.
The bug doesn’t affect the reason of disqualification - the node was not able to provide 1kb of data within 5 minutes more than three times. This is proof of unreliability.
There is no evidence to reinstate.
Despite of taking into consideration, there is no code and no solution so far. This timeout audits is too easy to abuse, our engineer showed how. It will be exploited.

BrightSilence · August 20, 2021, 10:17am

I do think this should count for something. It’s much less likely that an over 2 year old node is suddenly going to be used for abuse of the network. Not impossible of course, but some extra lenience may be warranted at some point. As it stands, there is kind of an inverse incentive where long standing loyal nodes with more data also get more audits and as a result get disqualified more quickly than smaller nodes when temporary issues of non-responsiveness happen. That just doesn’t seem right and should be remedied.

I think it’s a little convenient to omit the part that that bug wouldn’t have changed the outcome in your scenario though.

I think there is a nuance of difference here. The node shouldn’t be disqualified if this was an unintentional temporary issue and measures can be implemented in the node software to crash the node to prevent that. At the satellite end though, it is still impossible to differentiate between that and someone trying to fool the satellite by forcing timeouts, so as long as there isn’t a change implemented in the node system, the satellite will still have to disqualify nodes. It sucks, but it is what it is. I’m just happy to hear the team is taking this issue seriously and agrees that these issues shouldn’t cause disqualification. Lets hope they come up with something good to prevent that in the future.

Stob · August 20, 2021, 10:54am

I’ve been keeping up with this topic as my node had an almost identical issue a few months ago, which luckily I did catch. Randomly my hardware raid controller froze, causing access to storage to go very slowly. The whole computer went very slow, the storagenode was technically still accepting requests but was not responding to anything as far as I could tell. I ended up rebooting the computer which “fixed” the issue.

From an outside perspective if the node isn’t responding I don’t think the node could crash itself, but on Windows maybe this would be a suitable task for the storagenode-updater service. It would be helpful to have more information about a failing node, to understand what factors could be used to determine if it should be crashed/shut down. Perhaps a scheduled task which checks for a response from the node api? or a log scan looking for ‘Downloaded’ in recent entries?

Toyoo · August 20, 2021, 10:54am

If so, setting up monitoring is also useless. Even if this kind of issue is transient, you want these nodes to disqualify, because monitoring is not an alternative unless it would be capable of waking up node operator at midnight.

BrightSilence · August 20, 2021, 11:57am

It’s really not helpful to make such statements, it’s clearly not true. Lets try and keep it constructive. It’s a tough problem to tackle, it needs a fix and it’s being looked into, but the danger of abuse is very real. V2 had a lot of such issues which made it quite a bad experience for honest node operators like us. Nobody wants that either.

Toyoo · August 20, 2021, 1:11pm

Well, that’s how I interpret:

Alexey · August 20, 2021, 4:52pm

Could you please formulate it in more constructive key and not destructive as I did?
I want to learn how to deliver a bad news better.
The key information: node failed audits thus - disqualified.
I did not want to create an impression, that “I want to disqualify”.

Alexey · August 20, 2021, 5:01pm

By the way this bug (with repeated audits of the same piece, when the node was in containment mode) is now fixed: satellite/audit: fix containment bug where nodes not removed · storj/storj@5a1a29a · GitHub
Should be included in the nearest release.

Toyoo · August 20, 2021, 5:22pm

That depends on what you wanted your statement to mean. I assume here you don’t want to mean that exactly four 1kB pieces not received within a 5 minute period each will lead to node disqualification, and that there are also other factors involved in the decision on disqualification, but you don’t want to discuss them right now. So all you need is to make the statement less definite by adding some wording that at least suggests existence of these other factors.

«proof», «reason» are strong words, they imply certainty. «symptom» or «sign» imply an indication of something happening, but without the nuance that they’re the only factors taken into account.

So, maybe something like: «Storj considers failing to provide a 1 kB chunk of data within 5 minutes a sign of node being currently unreliable and hence a candidate for disqualification in future.»

But if indeed you wanted to state that four failed audits should lead to disqualification, then you expressed that pretty much to the point.

Alexey · August 20, 2021, 5:31pm

Mathematical algorithms show that failure to provide a 1K chunk within 5 minutes in a repetitive manner indicates that the node is becoming unreliable and will be disqualified as a result of continued failures.

Storj Engineers thinks, that in some cases (“old node, which suddenly started to fail audits because of timeouts”) failing audits because of timeouts is not attempt to abuse the audit system and want to avoid quick disqualification of such nodes without affecting reliability of the network.
We did not find a way yet how to achieve that.

BrightSilence · August 20, 2021, 6:01pm

Thanks for this message, I really appreciate the switch in tone!

I think I can give some clarification on this part. @Alexey linked the containment blueprint in an earlier message, but that’s a lot to read. However, this is what it comes down to. If your node is unable to respond to an audit within 5 minutes, it isn’t immediately seen as a failed audit. Instead whenever the next audit happens for your node, it will request the same stripe again. This ensures you can’t use a timeout to get out of having to verify you still have that piece. You again get 5 minutes to respond. I believe you get a total of 3 retries, but I’m not entirely sure of that number. If you time out on all tries, then it will count as a single audit failure. From that point on audits will request a new stripe again. This wouldn’t cause a disqualification, just a single audit failure. You’d need quite a few more to actually be disqualified. Problem is that nodes with lots of data get quite frequent audits. So even if it takes 4 timed out audits to count as a single failure, you can still rack up those failures pretty fast.

I previously posted a suggestion to tune audit scoring here. Tuning audit scoring
I think this suggestion could be further tuned by raising the lambda value based on the amount of data stored. The more data you have, the more the memory of previous successful audits weighs into the formula. This would essentially remedy the inverse relation between long time loyal node and short time to disqualify. These nodes also have much more on the line, so messing with the system on such nodes is just a very bad idea to begin with. If you’re going to try and cheat the system you wouldn’t do that with a node with so much income. Just a suggestion, but I think this would help a lot with this problem without exposing the satellites to more risk.