Put the node into an offline/suspended state when audits are failing

I think that Storj sometimes forgets that the node operators are regular people, some of them even running their nodes on the “recommended hardware” (raspberry Pi with a single HDD connected over USB) and not companies operating servers in a Tier 3 datacenter. This disconnect between recommendations and requirements is something I have noticed a long time ago and have commented about in the past.

The biggest difference between the two is not the hardware and not the reliability (as seen, even a RPi can be pretty reliable), but the response time, especially since the node software does not support running in a cluster or using HA.

If I rent a server in a proper datacenter, I can pay additional money for 24/7 support. They will have an emplyee nearby in case of some emergency and problems will be solved in an hour at most or whatever the contract says). It is, IMO, very unreasonable to ask of a regular person, who does not have employees and probably spends a lot of time every day away from the servers (at work), even goes on vacation. In case of any problem with the node, the operator should be given sufficient time to fix the problem before permanently disquaifying the node.

So far, of the failure modes, some are handled, IMO, correctly:

  1. An offline node is allowed sufficient time to return (unlike previous, unreasonabe requirement of no more than 5 hours of downtime).
  2. In case of “unknown” audits, the node is suspended and, it looks like, allowed enough time to fix whatever the problem is.
  3. In case the hard drive disappears (USB connector fell out or something), the node shuts down and refuses to start.

So, now we have regular audits. As I understand it, they can fail in threee different modes - “file not found”, corrupted data and a timeout. None of these, especially the timeout definitely prove that the node has lost data. It is very likely that “file not found” is because the data is lost, but it could also be that the USB cable fell out, or the file was deleted by the network 0.1s ago, or the file had an expiration date. Corrupted data most likely is because the hard drive is failing, but it could also be due to some other reason.

While the previous two have high chance of being because the data has been lost, timeout does not mean that. The node could be overloaded or something. Yes, it is a problem and the operator should fix it, but not in 4 hours.

Here’s what I think should happen:

  1. Satelite issues an audit request to the node.
  2. Node tries to read the file and encounters an error it can detect (file not found) or an error it cannot detect (corrupted data or timeout).
  3. If the error is of a type that the node cannot detect, the satellite informs it that there was a problem.
  4. If two problems happen in a row, the node marks this in the database or creates some file to mark it and shuts down.
  5. The node refuses to start, unless the mark is (manually) removed - this is to prevent auto-restart scripts from just restarting the node.
  6. Node operators gets an email informing him of this problem (that his node is offline).

The node operator can then fix the problem and restart the node, hopefully there are no more problems. If there are, the node will fail more audits, shut down again and eventually be disquaified.

6 Likes

I’ve given this a bit more thought and considering the fact that most home operators would not have a separate system I think it would be better to have some self test features integrated into the node instead of MND. This would save single node operators from having to install something separately as well. What I was thinking off would be to have a simple status page on a separate port. When loading that page the node does a quick self test for uploading, reading and deleting a piece and then provides a result. Many node operators are already using uptime robot, which also supports word detection. So you could set up a monitor easily to request that page and look for the success result. This would have the added benefit that uptime robot also tracks response times, which would be an early indicator of things going in the wrong direction. It also saves Stork Labs from having to set up notification features and allows people to choose their own way to monitor that status page.

This is a little beside the point, but I heard it mentioned that MND would eventually replace the dashboard as it exists now. If that happens I would also suggest integrating MND in the node software (installer or docker container) to avoid having to install separate things.

It allows them to try and find where the line is without permanent consequences. Pushing it further and further until suspended and then backing off slightly. Remember that the satellite only knows it’s not getting responses to audits, it doesn’t know if that’s because someone is trying to cheat the system or having intermittent issues.

1 Like

That solution did not work, unfortunately. When a hard drive disconnected, the script just got stuck waiting to return from the function checking the availability of the file.

I completely agree.

1 Like

As long as the requirements are reasonable and consistent it should be OK. However, some requirements appear to be unreasonable and the software does not even allow anyone to fulfill them.

For example - what OP could have done to prevent his node from being disqualified when the issue with the kernel happened?

  1. 24/7 monitoring of the node, noticing that audit score is dropping (I hope that the audit score was slowly dropping over the 4 hours instead of going 1 → 0.5) and quikcly shutting down the node. This is unreasonable.
    1a. Writing a script to do the same. This is more reasonable, but still not everyone may be able to do it, so I proposed that the node would shut itself down in case of audit failures.
  2. Running the node in a cluster with the Storj version of HAProxy in front. Sadly, the node software does not support this, the best he could do was to run the node VM in a cluster, but that would not have prevented the issue with the kernel.

I did not know that. Yea, it has to feel awful, using the recommended hardware and then finding out your node was disqualified because of a USB problem.
I think that Storj is one of the few projects where following official advice increases the chance of you getting banned.

Unfortunately statistics does not confirms your words. Most of nodes are able to fulfill requirements.
Unfortunately I cannot say how many Storage Node Operators have something additional to keep their nodes running.
I can only operate with my own experience. As you know, I especially and knowingly did not do anything special, all my nodes were setup following the instructions. This is done to prove that documentation is sufficient and reliable.
For example, the raspberry Pi were setup in January 2019 and still working. I have had a few issues though:

  • sometimes the watchtower downloads broken image and the storagenode did not start properly. Happened 3 times. Re-download is a solution (maybe it’s not a watchtower but docker)
  • the OS on microSD were corrupted (I experimented with swap on microSD - it’s a bad idea :slight_smile: ), flashing a new image solved the problem. The microSD is still functional, but seems on its end of the life.
1 Like

I think worst scenario is what we have right now: Quick disqualification without prior notification, without indication and without the possibility to revoke.

Any change in those areas would improve the situation more or less.

4 Likes

I know this is not the point of your post, but this wouldn’t happen. If a node sends bad data to a customer, it’s no big deal as the uplink would know which pieces are bad and invalid erasure codes. That’s one of the great things about reed solomon encoding. You know which pieces are bad and this is exactly what audits use to determine which nodes have returned bad data. As long as there are still 29 good pieces, the uplink would work just fine and retrieve the requested data. If there are fewer than 29 good pieces, it will know that pieces are bad and throw an error. It will never return bad data.

Let me just lead with: SNO - Storage Node Operator
So yes, please don’t add logic to my code, my brain is running just fine atm. :wink:

I think the point is that the node would have to have a way to do a full stack test. No external monitoring software is going to be able to do that if the node won’t accept uploads and downloads that haven’t been signed by any of the known satellites. So as far as I can tell, changes are required. I’m suggesting to go one step further to also allow the node to trigger that test itself and provide an endpoint that can be monitored by any monitoring software.

You’ve seen the concerns raised about that direction and don’t really provide any response or solution, so I don’t have anything to add. It’s still a bad idea.

So, without quoting specific parts as that would make this post way too long. Your suggestion would have the node suspended and excluded after only 2 failed audits with the current scoring system. But by far the biggest problem is that messing with the audit system doesn’t lead to permanent loss. If someone trying to cheat the system can recover, they will try and find the line. It doesn’t matter if you put them in vetting again. In fact, I’d welcome it so I can make sure one of my nodes is always in vetting to get data from both the vetting and trusted node selection cycles. I could use that system to get out ahead. It also requires significant changes to satellite code, while the satellite doesn’t have any information about what kind of failure the node is happening. It’s a blunt force weapon swinging in the blind.

This would have killed almost all nodes during recent troubles with ap1 and us2 though. I have written such a script to stop the node a long time ago, but never put it into effect as it is really an all or nothing approach. I do find it interesting that you find this unreasonable, but don’t mention the same for running an HA setup. I’d say that’s probably the more unreasonable expectation of the two. :wink:

Right, so this is the core issue. And I think most of what we’ve seen recently has some common characteristics. It tends to be that the node becomes unresponsive and doesn’t even log, but apparently does just enough to signal to the satellite that it is still online. I’m with @Alexey that this isn’t all that common, but at the same time, it may be the most common issue that node operators face atm. We’re still dealing with nodes that run into a failure state though and almost certainly not one caused by the node software itself. So ideally we would get better ways to monitor that so we can take better care of our nodes. But I don’t think it’s on Storj to fix the underlying issues here or be more lenient on the requirements.

That is the point. In that case the signal is wrong. The node is online, but it is not in an operational state.
To check the state, node could detect that satellite keeps requesting the same pieces over and over or node could detect that piece has been requested by satellite (download started) but never finished (no “downloaded” status for the piece).

1 Like

Most nodes, yes. Raspberries are not failing one after, neither are desktops. It is certaily possible to run a web or email server on desktop hardware with no RAID and it will most likely work OK for years. There are servers that are running for over 10 years with original drives still in place. Does that mean that I should have used RAID0 instead of RAID1 when setting them up? Probably not. Most people would be able to drive with no seatbelts because most people do not crash their cars :).

I am all for disqualifying nodes that have lost data. Losing data should not happen (but some data may be lost if the node operator does not mount the filesystem with “sync” parameter), at least the node should not lose a significant amount of data.

However, lets take the specific problem that OP had and figure out a way to prevent such problems in the future. As far as I know, OP was running the node successfully for years, until a kernel bug (or maybe a bit flip in non-ECC RAM - ECC RAM is not a requirement) caused the node to time out when responding to audit requests.

  1. The node port was still open, so the recommended monitoring by “uptime robot” and similar services did not indicate any problem.
  2. There was no indication of the problem in the node logs (no ERROR entries).
  3. As I understand, the only indication of the problem would be a decreasing audit score. However, as far as I know, audit score is not updated in real time, so there would be an hour or so before the lower audit score would be displayed in the web dashboard and the API.
  4. The node was disqualified in about four hours after the problem started, so the low audit score on the dashboard was probably there for 3 hours.

So, what could OP have done to prevent the problem from appearing?

  1. Not have the kernel bug - not really feasible to have bug-free OS.
  2. Use ECC RAM if the problem was caused by a bit flip. Maybe the requirement list should be updated to include ECC RAM?

OK, it was not really possible to avoid the problem, so what could OP have done to prevent his node from being disqualified when the problem happened?

  1. Monitor the web dashboard more frequencty than once every three hours.
  2. Use zabbix or whatever to monitor the audit score and send him SMS - if this happened in the middle of the night, wake up and reboot the node VM.
  3. Write a script that would shut down the node if audit score droppeb below, say, 0.9.

If it was possible I could set up HA and some of the tiem be more than 100 meters away from my servers. So, to me this is more reasonable than expecting me to monitor the node 24/7 and always be able to fix whatever problem occurs in less than 3 hours.

Yeah. Some nodes just fail and lose data (the single hard drive dies, operator accidentally deletes the files etc), but these other problems (USB cable fell out, system froze) should be recoverable, since the data is not actually lost.

2 Likes

Self-healing is the process of reconstructing correct data, and then ensuring that there’s enough correct pieces in the network. The client’s code performs the first part, but not the second. What’s more, this process need to be performed on a regular basis, and we don’t expect customer to have their clients running all the time like in torrents. Hence the satellites, and the audits.

And sometimes newer versions have new bugs.

In my opinion the recommendations should be such, that if the node operator follows them, he will still have a node after 10 years. This, to me, would be a more honest approach than the current one which basically is “use whatever, it should be good enough… it wasn’t and it’s your own fault for following our advice, you should have done more”.

Then, when a node is disqualified, it would be easy to tell the operator “see, you did not follow this part of the recommendations and that’s why your node is dead now”.

But maybe if RAID, ECC RAM, a monitoring system and such are included in the recommendations, it would scare off some potential node opetators? It would be better, IMO, than just deceiving those operators by saying “doing a, b and c is enough” and then punishing them for not doing d, e and f.

Also, the software itself does not really allow for fault tolerance and recovery:

  1. No support for clustering and HA (which would have saved OP).
  2. No support for backups.

So, in effect, the requirements are more strict than what I would expect from renting a VM in a normal datacenter.

2 Likes

Regarding my hardware, since someone in the thread asked for it:

  • Odroid HC2 with 8 cores, 2GB RAM, SATA port, 1Gbit Ethernet
  • 8TB Seagate IronWolf CMR disk connectet to SATA port
  • 1Gbit download /100Mbit upload optic fiber Internet connection
  • 1000VA UPS which backs up not only the board but also the internet connection (ONT and router)

The board is running Armbian and it is regularly upgraded. The dmesg kernel oops log is in the opening message of the thread linked at top. It’s something which could happen to everyone, like a BSOD on Windows, it’s not operators’ fault because of bad behaviour.

Before the current hardware my node was running on a USB drive connected to a Orange Pi PC2 board (a Rasbperry Pi clone) and I was experiencing continuous suspensions. So I decided to buy a hardware in line with the expected Storj service quality and it was a wise decision.

I can assure you that before this event the logs were absolutely clean, no errors at all, the node was performing well without any problem. Below my success rate report, just extracted from the on line log file:

i was thinking about this and one thing that certain won’t do any harm would be adding a shutdown for the node if it isn’t able to offer files, corruption is one thing.
but sometimes latency issues can cause the files to exist but not being sent, this will lead to DQ afaik.

fixing that issue could be the first step toward mitigating these issues which might happen for computers without hardware watchdogs.

It’s a very real possibility that the node can’t detect anything at all if the system is that unresponsive. So unless you have something outside of the system monitoring that, you can’t be sure you will be notified. This is why I was suggesting a status page that can be monitored by uptimerobot as that would also detect when the status page is down or the node is not able to perform the check and gives an error or timeout on the status page. It doesn’t rely on the node still being able to do anything.

Audits don’t test data, they test how reliable a node is. A single node or a few of them sending bad data is not a problem. A lot of them is. So to ensure you never get to that point unreliable nodes are disqualified. All the satellite really knows about “good” nodes is that they don’t have a lot of bad data. They may have some that is never audited, they may have some that has been audited, but didn’t rise to disqualification. The satellite just determines that nodes are still good enough based on audit response. There’s a lot of statistical calculation that went into determining what levels of loss are acceptable, to prevent ever losing data from customers. I believe Storj says 11 9’s of durability. But so far not a single segment was ever lost.

You can provide a status check that follows the same path as an audit would. So you can get really close to the exact same procedure. And even despite that you can make sure SNOs know the status page is only indicative and satellite audits necessarily remain the authoritative judgement of node health.

Fair enough, it seemed to be phrased as a counter argument, but was missing the argument. But it is indeed up to Storj.

Right, I skipped a step in my response. Currently the audit score drops 5% on a failed audit, so 2 failures in quick succession would drop it to 90%, which was the threshold you mentioned. I guess it should technically be 3 to drop BELOW 90%. But that’s where that number came from.

Sure, but I think that may point to a very specific issue and there may be many different things that can lead to a system being unresponsive. I agree with you that it’s not up to Storj to solve any possible problem that any system might have. So I would suggest going for a detect and report approach and leave it up to the SNO to fix the underlying issue.

But that would exclude a lot of people who are happy to take that risk of loss. Many of whom have been running nodes for years without problems. Advising only limited risk setups in a network that is entirely build to deal with high risk setups also doesn’t really make sense. Maybe they can be a little more clear about what the consequences of hardware choices are. But then again, with a little effort to keep an eye on things many people run rpi’s with USB drives for years. The danger of not recommending those simpler setups is that it would lead to centralization. If only people with data center like setups would run nodes, that necessarily leads to more centralization, which is actually worse for the network than having more independent but less reliable nodes. (No risk of coordinated failure among unrelated nodes)
This is almost turning into a game theory problem…

Then they can take the risk by doing less than recommended, but at least they would know that there is a risk. For example, while it is recommended to change the oil in a car every year or after driving 10000km, you can choose to not do it and would probably not see any problems for years. But wouldn’t this be better than being told “no, you don’t need to change the oil ever” and then the engine fails because of the old oil?

Consider system requirements for a game or program. There usually are “minimum” and “recommended” setups. You can try to run it on slower hardware but don’t complain if it does not work. Also, don’t complain that the game does not run on high settings on the “minimum” hardware.

There are people who have been driving drunk for years and never get caught or crash (until they do). This does not mean that doing so is a good idea. However, at least they make an informed decision to do it (“it is illegal to drive drunk, but I’m going to do it anyway”). Someone, who has read the recommendations for running a node and doing everything that was recommended to do would be happy that he did everything he should do to run his node. However, if it fails, then the same operator gets told “no, you didn’t do everything - it’s your fault for not doing the things that were not on the list”.

Or maybe it turns out that once the real recommentadions are written that “Monitor the node no less frequently than every 3 hours, be ready to fix problem within said 3 hours at all times” is a bit too much to ask of a, presumably, home user.

2 Likes

Analogies only get us so far and they really fail here. If a node being disqualified would mean you destroy your car (or your node hardware for that matter) or run the risk of killing someone, I imagine the recommendations would be different. I don’t see how these are useful comparisons to make.

How’s this for a recommendation: “accept your node might fail and you would have to start over”
It’s not the end of the world. Small nodes would fill up again, larger nodes would perhaps take a bit longer, but it’s temporary loss of additional income. It’s not a loss of initial investment, hardware or a possible risk to your own safety. Let’s try to keep some perspective, shall we?

No point in doing something that will most likely fail IMO.

And still, I think the oil change analogy is a good one, in that you can disregard the recommendation and not have any problems for many years, even telling others about your lack of bad experience.

While the loss of node does not cost as much as an engine rebuild, it still is a significant loss of time and effort investment, if nothing else.
I think that peopel should be allowed to run their nodes however they see fit, but that people should be informed about what is the really proper way to do it. Then, if you want to cut corners, free to do so, maybe nothing bad happens, maybe your node will blow up, but don’t say I didn’t warn you.

Imagine your client wants to run something important on a raspberry with no RAID. You should inform him that it’s a bad idea, suggest a good idea, but still allow him to choose the bad idea. If something bad happens now, at least you informed him that it was the wrong way to do it and he chose to disregard your advice.

This is what I would liek to see in the SNO recommendations - a real proper way to do it. Then, if someone wants to not use RAID, ECC memory or monitoring, he can do it, but at least he now knows what he should have done.

Since August 1st, my node got 200GB ingress, so it will most liekly get 400GB this month. Assuming this 400GB/month ingress is constant and there are no deletes, it would take ~45 months for my node to fill back up to 18.35TB.
4 years is more than “a bit” longer. I would probably be better off using those drives for chia if my node blew up.

2 Likes

It’s not, because you are putting an upfront investment at risk by not changing the oil.

Well, there is nothing else. And if you’ve set up a node before, you can set up a new one in about half an hour… I will say that is basically no cost at all since none of us would be here if we didn’t find this stuff interesting to begin with.

The proper way to do it is to use hardware you already have, so you have no upfront costs and all of it is profit to begin with. This benefits the SNO as there is literally no downside and it benefits the network as it promotes decentralization.

If he wants to run a fleet of them all over the world and has built in redundancy in the form of erasure coding, I would encourage him to do that and see if he can make that cost effective. You say something important, your individual node is not important. So again, this analogy doesn’t work.

You disagree fundamentally with how the people building the network envisioned it. They will never say that that is the proper way to do it, because they don’t agree that it is. And neither do I. We can go through the trouble of calculating the cost of the risk of loss. I’m sure Storj Labs has some stats on average risk of disqualification. I already did that calculation a while ago for RAID, but now you’re throwing in the cost of ECC memory, and what else? Redundant power supplies, UPS or even multiple locations and entirely separate systems for HA setups? You’re raising the upfront costs sky high to prevent a relatively low chance of temporary loss of income. That calculation just doesn’t add up. And furthermore, setting such standards would scare people with smaller and less redundant setups away from giving this a try, while the entire point of Storj is to have data as distributed as possible and using hardware that would otherwise go to waste. I can guarantee you, you are never going to convince Storj Labs of that stance. As it is fundamentally contrary to what Storj was built for.

I do both, Chia keeps the drives warm for when Storj wants to take over. Sounds like a good middle ground to me.

So in conclusion, I have a deal for you. I would like to start sending you money every month. I’ll start with small amounts that will go up over time. Now, there is a risk that in a year or two I will stop sending you money. If I do, just ask me to start again and I will. But when I do I will start with sending small amounts again. Do you accept? Or would you rather not get any money at all because there is a risk you might no longer get it or get a little less in the future?
Alternatively, if you give me $1000 now, I will almost certainly not stop sending you money… buuut, I still might and it may take more than 3 years before I give you at least that amount back.

Let me know if you need my account number to send me the $1000. :wink: I’ll be sure to send you these amounts in the first year.
image
You know… as long as you don’t mess something up yourself :wink:

Particularly if that user is on L1 and waits a year to get paid anything at all!

I don’t think it’s a serious question, but they only pay me for my nodes like any other SNO. I just happen to agree with their vision for the most part. If I didn’t I wouldn’t put so much effort into being here, discussing things and creating tools to help SNOs out. The idea of being paid is tempting, but I believe that being independent is more valuable. For example, I initially advocated for Storj to provide a better earnings estimator. But what is better than an independent estimator that also keeps Storj Labs in check. They have no control over it and if the network starts being less profitable for node operators the estimator will reflect that whether they want it to or not. Over recent months I’ve adjusted it down several times. I don’t think that would have happened as quickly (or perhaps at all) if they were in control and I would probably feel like I should discuss it with them first if I were paid. So no, I value my independence. And my opinion can not be bought.
But choosing my battles carefully also means that when I do speak out about something, it tends to be taken seriously. I’ve posted quite a few suggestions and alerted them of issues and so far the majority have been picked up and resolved or implemented. Many others got insightful responses from the Storj Labs team. I find it just leads to a more constructive conversation. But if you have any doubts about my integrity, I’ll gladly back up my positions further. I’m not here to have a universal agreement, that’s just boring anyway. So bring it on and put my feet to the fire! :wink:

7 Likes