What about exposing a health check API? When called from the external, the node will perform something like a pre-flight checklist to figure out if everything is fine and if the node is full operative, i.e. by also requesting a control audit from satellites, obviously with a rate limit.
It would be up to Storj to implement this API and make sure that if the API returns “OK” it really means it’s all OK, and it would be up to the SNOs to call and monitor periodically (i.e. every hour or whatever interval Storj decides) the API endpoint and take the appropriate actions.
I know what I can do to avoid that. I would take a look at the storage node code and figure out why the storage node is not detecting the issue and going into a crash loop as intended. I have even gone so far to plan to write a few more unit tests to just throw different errors into the corresponding function. If I can proof with a unit test that the storage node ignores certain errors it should be easy to just add a few more commits to make that unit test pass.
Do I need to write that unit test and the fix myself? I don’t think so. Everyone with a bit of development skill is welcome to do this. I am happy to point you to the corresponding code lines and also lure in a few people from storj to provide additional information. I would love to enable the community to work on smaller issues themself.
I really doubt you would want my “Go” code in your nice software. While I may be able to hack something together for myself (like I did with v2 at one point), I don’t think I should subject other people to my “mods”.
My problems with disqualification are mostly about the time it takes for the node to get disqualified (or, in other words, the required reaction time from the operator) - it should not be too short so that people who do not have employees would have enough time to fix a problem that did not result in lost data (obviously if the files were deleted or the drive died in a setup without RAID there’s nothing an operator can do, but USB cable falling out or the OS freezing does not mean the data is lost).
It sure isn’t, because that rocket would not get off the ground. You’re making lots of assumptions about setup and since the recommended setup is to use what you have and what would already be online, cost of operation should be close to 0. In which cases statements like this don’t make any sense. Either provide an example setup or stay away from generalized statements altogether. It’s not helpful.
I ask around constantly as this information is important to keep the earnings estimator as close to reality as possible. I’ve never heard anything over about half a year and even then that’s only for the last satellite. And the only reason for that is that that satellite is basically inactive anyway, so it would have pretty much no impact on your nodes income. You also again provide no actual numbers and skip describing the setup. Maybe you have several nodes in vetting so they share the same vetting ingress. I can’t incorporate such reports in earnings estimator feedback because they are not reliable. Luckily there is enough info that is reliable out there so I can still come close to a realistic estimation.
Well, to be fair, if this becomes a serious enough issue I would expect Storj Labs to pick it up. It would be awesome of course if someone well versed enough in Go would be able to do this, but for serious problems you can’t rely on that. But we can’t dictate your backlog and priorities. All we can do is point out a problem that has popped up a little more frequently than we’d like to see on the forums. I’m personally still not certain if this should be picked up with priority, but I’m monitoring it as it seems to have recently happened a few too many times as well as with pretty long time reliable SNOs. I at least think that the devs should be aware of it and there should be a spike on the backlog to investigate. But at this point I can imagine it’s not yet a priority to pick up now.
I think it might be good to implement a function that stops a node automatically when it presents errors (Faileds upload) in order to avoid the suspension of the node in the satellites.
This is not the first time it happens to me but before yesterday I was copying files to a hard drive where my Storj node is and that increased the “IO Delay” significantly, after that the node did not report problems while copying files at night, The next morning I see in my node panel that I am suspended on satellites (probably for a long time serving the file which caused upload fails).
The ideal thing would be to implement a function that when it detects that 10% (for example) of the requests are Failed, it stops the node so that the operator can see what happens before ending up completely suspended in the satellites.
I must say that since I was suspended I have not yet received the typical email from Storj that alerts you that you have been suspended in X satellite, I also think that is another point to improve and give priority in the Emails Delivery so that those emails go out first to the Operators.
It feels to me like what you’re asking for is exactly what suspension is.
The real issue is not suspension, but disqualification. To avoid disqualifying nodes that behave in a wrong/unknown way (but still respond correctly to audits), they get suspended first so SNOs have time to investigate what’s going wrong.
You’re suggesting to stop nodes, and that’s kinda what happens when they’re suspended: they’re paused (no ingress) until they get fixed (or disqualified for good if they do not get fixed in time).
[EDIT: The above is confusing, see @BrightSilence’s added details below for a clearer explanation]
Is that not covering what you have mind?
I agree, notification e-mails could use some improvements.
My understanding is that this topic is about avoiding (very fast) disqualification that strikes without any suspension time window, so maybe we could suspend nodes that are about to get disqualified, but there’s debates about that idea.
I’m not sure what you’re getting at here. But there is no order of first being suspended and then disqualified. It just depends on the type of errors. If you fail an audit with a missing file or by responding with corrupt data or repeated time out, it will hit your audit score and if that drops low enough, you get disqualified.
If there is a different kind of error that isn’t one of those really bad ones, it hits your suspension score and your node gets suspended if it drops low enough and you can recover.
If you’re offline the online score drops and you get suspended. You can recover if you’re not offline for 30 consecutive days.
Which is very well possible, as I was confused by its self contradictory nature. Nodes that behave in a wrong/unknown way are detected through incorrect audit (or repair) responses. And the “they get suspended first” suggested an order of sorts.
After your response I’m guessing your intention was to say that nodes that don’t have critical audit failures, yet fail with unknown errors get suspended. And the “first” was referring to that eventually if the node never recovers it gets disqualified as well.
I agree on that part, I’ve suggested something similar elsewhere. The idea was to very quickly suspend the node to protect data and start repair and then be slightly more lenient with permanent DQ by allowing the node more chances to respond to the same audit. I’ll try to find the link.
Edit: Got some of the details wrong, it’s been a while, but here it is. Tuning audit scoring - #52 by BrightSilence
Please note that the suggestion is in context of the topic which had suggestions to stabilize the scores as well. The numbers in this post assume those suggestions are picked up in addition to the suggested suspension/disqualification change.
There should not be a reaction time requirement of a few hours. The node operator should be able to leave the node completely unattended for at least two days and still be able to recover it if the failure is not permanent (dead hard drive etc).
7 days at least. Better 30 days. I mean people have a life, they have family. The go on business trips or leisure vacation. Also they get sick, have emergencies or technical failures like internet/power outages or whatever…
I agree. This is a bit of a balancing act of course, because you want to protect data asap, but don’t want to immediate impose irreversible penalties. This is why the suggestion I linked suggested to very quickly suspend to protect data and start repair, but allow for up to a month to resolve the issue and recover from that suspension. While suspended you will of course miss out on ingress and lose data to repair, but this isn’t nearly as bad as losing the entire node and serves as an incentive to fix things as fast as possible. That seems totally fair to me.
As for the data protection side, that suggestion also ensures that if you did lose data or mess something up in an irreversible way, you will never be able to escape that suspension and eventually be disqualified. For the data protection side of things it doesn’t matter much if you’re disqualified or suspended as the data protection measures are the same for nodes in those states.
So this is very possible to do in my opinion. We had some good back and forth in that topic, but it seems to have been deprioritized. I get that, it’s not really necessary atm for (better) functioning of the product or protection of the data. But I still hope something like it will at some point be implemented.
Exactly. The issue is about not losing a valuable node entirely because of some recoverably mistake. Maybe the duration can be depend on node age. For every month of node age you get 1 additional day of recovery duration. Something like this…
This seems fair, as long as it is possible to not have the node disqualified. With the rules as they are right now, if I wanted to go on vacation for a few days where I would have bad internet connection or no internet at all, I would have to rent a space in a datacenter for my node and have someone else manage it. This is true even though I have two internet connections and a large UPS (and soon to have a generator), because if the node manages to freeze in a weird way (IO subsystem freezes or something), it can get disqualified in a few hours, probably faster than I would be able to connect and reboot or shutdown the node.
I think some of this comes down to it being more about notification than anything else. A node should probably be aware of when it is experiencing severe error conditions and notify via email of those conditions when possible. Ideally this should be baked in to the node software itself.
In addition, if a node is failing audits, after some threshold an email should be shot off that informs the node operator of this threshold being reached, so they can investigate timely.
I think as capacities of nodes increase, Storj will have a greater interest in participating in keeping node’s data availability high so as to reduce toil on the network in rebuilding missing data. In the short term though, the efforts continue to focus on improving the overall network features, performance, and stability and so I think some of this polish will come at a later time.
Better notifications will certainly help, but they are not a full solution. On larger nodes audits happen quite frequently and that is more the case now that repair is also used for audits. Fail 10 consecutive audits and that node is gone forever. This can happen easily before you have time to respond to a notification.
I’m gonna disagree with you here. It’s very well possible that the node software itself is hanging and not responding correctly. It will then also not be able to notify the operator, which pretty much defeats the purpose. The best is to have something external from the nodes network so it can check that the node is externally accessible and able to respond to requests for pieces. The satellite could do this, but perhaps an option of an externally hosted audit service could also do that trick. Maybe even something built into the multinode dashboard, which can already be hosted remotely, in the cloud for example.
The largest nodes already store 22TB+. And at the moment they are actually at the highest risk of super quick disqualification, because the number of audits and repair both scale with data stored. That’s kind of the opposite of what you want. The nodes which would hurt the most for the network to lose are also the ones most likely to fail this way. And that’s not good for the node operator, nor for the network as a whole.
That’s totally understandable. And I think there have been great improvements from that end already. It’s clear that the teams are doing great work. As long as this part isn’t forgotten completely. I think it will have to be tackled at some point.
A bit of a side-topic, but that is something I still don’t get: Why data that got repaired while being offline gets removed from the “faulty” node when it gets back online: Leaving the data on that node would mean extra pieces for the network, which in turns means less repair in the future, statistically. It feels to me like removing those pieces from the node that was offline for a bit is like shooting ourselves in the foot…