Put the node into an offline/suspended state when audits are failing

How do you tell the difference? How should the satellite know which operators are bad and need punishment and which don’t?

It makes a huge difference if a bad node gets disqualified or suspended. The whole idea behind disqualification is to prevent bad nodes from even trying. If we would suspend them first that means they can get away with it. They can simply delete 1% of their hard drive every week until they get suspended and then wait for the system to unsuspend them. Keep at that level for maximum profit. You see we can’t implement something that would allow bad nodes to game us.

4 Likes

in sense i agree with bright, but i also think one could simply make a script for it… even if it comes with a bit more overhead.

so then imho the question really is if storjlabs want to deal with the issue head on, or push it until they are forced to deal.

having nodes silently running such a system cannot be worse than what storjlabs can conjure up.

I’m probably missing something here. How is suspending a node leading to bad actors taking advantage?

its not about bad actors it’s about the network’s data integrity.

Yes, of course, provided you allow enough time to sort the issue out. Sending an email at 3 a.m. local time and expecting that the issue will be fixed in an hour would be quite useless. :slightly_smiling_face:

That’s not what I wrote. My node HAD a problem, this is 100% sure, but I had no way to know that a problem existed. Instead Storj knew exactly that there was a problem. That’s the point. The lack of communication.

Do whatever you want, this idea suggestion is just to start the discussion, but, please, enhance the flow of information between satellites and SNOs, for the benefit of SNOs, Storj and your clients. We all have to benefit for SNOs responding quickly to problems.

I understand the problem of malicious nodes, but to be honest before you mentioned it I didn’t even thought about the possibility that some SNOs could do that. Fooling the system is not in my way of working, but I understand that you must prevent this kind of bad behaviours.

1 Like

Then, if anything, what you are achieving this way is at most explaining why the proposal is not the best way to handle pietro’s case. But then there’s also e.g. this thread or this one. While the latter gained a specific solution, the idea from this thread would improve the situation of all three of these cases with a single approach. And, as you already stated, there might be other failure modes still lurking out there.

That’s why I believe it does have merit to discuss this specific idea away from a single incident.

I think that Storj sometimes forgets that the node operators are regular people, some of them even running their nodes on the “recommended hardware” (raspberry Pi with a single HDD connected over USB) and not companies operating servers in a Tier 3 datacenter. This disconnect between recommendations and requirements is something I have noticed a long time ago and have commented about in the past.

The biggest difference between the two is not the hardware and not the reliability (as seen, even a RPi can be pretty reliable), but the response time, especially since the node software does not support running in a cluster or using HA.

If I rent a server in a proper datacenter, I can pay additional money for 24/7 support. They will have an emplyee nearby in case of some emergency and problems will be solved in an hour at most or whatever the contract says). It is, IMO, very unreasonable to ask of a regular person, who does not have employees and probably spends a lot of time every day away from the servers (at work), even goes on vacation. In case of any problem with the node, the operator should be given sufficient time to fix the problem before permanently disquaifying the node.

So far, of the failure modes, some are handled, IMO, correctly:

  1. An offline node is allowed sufficient time to return (unlike previous, unreasonabe requirement of no more than 5 hours of downtime).
  2. In case of “unknown” audits, the node is suspended and, it looks like, allowed enough time to fix whatever the problem is.
  3. In case the hard drive disappears (USB connector fell out or something), the node shuts down and refuses to start.

So, now we have regular audits. As I understand it, they can fail in threee different modes - “file not found”, corrupted data and a timeout. None of these, especially the timeout definitely prove that the node has lost data. It is very likely that “file not found” is because the data is lost, but it could also be that the USB cable fell out, or the file was deleted by the network 0.1s ago, or the file had an expiration date. Corrupted data most likely is because the hard drive is failing, but it could also be due to some other reason.

While the previous two have high chance of being because the data has been lost, timeout does not mean that. The node could be overloaded or something. Yes, it is a problem and the operator should fix it, but not in 4 hours.

Here’s what I think should happen:

  1. Satelite issues an audit request to the node.
  2. Node tries to read the file and encounters an error it can detect (file not found) or an error it cannot detect (corrupted data or timeout).
  3. If the error is of a type that the node cannot detect, the satellite informs it that there was a problem.
  4. If two problems happen in a row, the node marks this in the database or creates some file to mark it and shuts down.
  5. The node refuses to start, unless the mark is (manually) removed - this is to prevent auto-restart scripts from just restarting the node.
  6. Node operators gets an email informing him of this problem (that his node is offline).

The node operator can then fix the problem and restart the node, hopefully there are no more problems. If there are, the node will fail more audits, shut down again and eventually be disquaified.

6 Likes

I’ve given this a bit more thought and considering the fact that most home operators would not have a separate system I think it would be better to have some self test features integrated into the node instead of MND. This would save single node operators from having to install something separately as well. What I was thinking off would be to have a simple status page on a separate port. When loading that page the node does a quick self test for uploading, reading and deleting a piece and then provides a result. Many node operators are already using uptime robot, which also supports word detection. So you could set up a monitor easily to request that page and look for the success result. This would have the added benefit that uptime robot also tracks response times, which would be an early indicator of things going in the wrong direction. It also saves Stork Labs from having to set up notification features and allows people to choose their own way to monitor that status page.

This is a little beside the point, but I heard it mentioned that MND would eventually replace the dashboard as it exists now. If that happens I would also suggest integrating MND in the node software (installer or docker container) to avoid having to install separate things.

It allows them to try and find where the line is without permanent consequences. Pushing it further and further until suspended and then backing off slightly. Remember that the satellite only knows it’s not getting responses to audits, it doesn’t know if that’s because someone is trying to cheat the system or having intermittent issues.

1 Like

That solution did not work, unfortunately. When a hard drive disconnected, the script just got stuck waiting to return from the function checking the availability of the file.

I completely agree.

1 Like

I’m pleased this is being discussed about how quickly a node can get disqualified from network.

I think LittleSkunks point is really valid, I know it’s sad when we loose a node, but any loop hole in the node code will be exploited - and while it’s annoying, Storj network has to be 100% reliable, even if the chance was 0.00000001% the risk of sending bad data to a paying customer would destroy the network if it went public.

I also think it’s really important NOT to put any more code or logic into the SNO node code when it comes to trust or reputation - again, stating the obvious it is open source, and you can draw your own conclusions - Storage node note meant to be monitoring Stac, the team would end up with even more code to check - there plenty of other products out there for monitoring, maybe more guides on how to set them up ?

I do think there are some code changes that could be made on the satellites to soften the DQ blow, while maintaining the network.

Looking at others suggestions, it all too much on Node - I really think Satellite has to make the call - if a node is bad, it has to go from network - failing audits is sad but they can so often be caused by bad hardware / configuration / over allocated hard drive (i.e SMR dies as it fills up ) filesystem effects performance, fragmentation, and sooo many more :frowning: maybe taking a bit of what we already have like below;

#important I think that if node goes into Audit failure mode, the penalty has to be really bad to discourage bad behaviour, if there is a way back it should be better than starting from scratch, but be bad enough to not make it profitable for bad behavour.

a) The satellites audit like normal, until a node falls to the 90% audit failure zone for a satellite (yes no longer 60% that far to nice)
b) flag the node in Satellite DB as excluded, to prevent it being offered to clients for upload/download (effectively making the node offline, but at satellite - there already looks like some boiler plate code around excluded nodes, although not delved deeper) (also offline node get no money for egress traffic, and in this Audit failure 90% zone they have halt put on payment for stored data - effective all time node is in this offline 90% state they get paid nothing per day)
c) notify storage node operator - could be email - could be new flag sent to dashboard (maybe both - needs to be api method for SNO to set to say node is back online)
d) start a 30 day DQ window, like the one used for online (making it that you are only allowed 1 DQ event per node (not satellite) in 30 days to prevent gaming of funds, and to really DQ bad nodes
e1) node operator fixes node issue - sets a flag on dashboard that get’s sent to satalite to bring node out of DQ window
e2) satellite will do test audit on node in queue (not realtime), if it fails node stays in excluded database, operator told node still not working
e3) if it works, node flagged as good, audit set back to 100%, node vetting level set to 0% to force re-vetting on all satellites - node would be limited using current process to be only taking small amount of traffic until re-vetted - 30-60 days (the penalty for failing audits)

As long as the requirements are reasonable and consistent it should be OK. However, some requirements appear to be unreasonable and the software does not even allow anyone to fulfill them.

For example - what OP could have done to prevent his node from being disqualified when the issue with the kernel happened?

  1. 24/7 monitoring of the node, noticing that audit score is dropping (I hope that the audit score was slowly dropping over the 4 hours instead of going 1 → 0.5) and quikcly shutting down the node. This is unreasonable.
    1a. Writing a script to do the same. This is more reasonable, but still not everyone may be able to do it, so I proposed that the node would shut itself down in case of audit failures.
  2. Running the node in a cluster with the Storj version of HAProxy in front. Sadly, the node software does not support this, the best he could do was to run the node VM in a cluster, but that would not have prevented the issue with the kernel.

I did not know that. Yea, it has to feel awful, using the recommended hardware and then finding out your node was disqualified because of a USB problem.
I think that Storj is one of the few projects where following official advice increases the chance of you getting banned.

Unfortunately statistics does not confirms your words. Most of nodes are able to fulfill requirements.
Unfortunately I cannot say how many Storage Node Operators have something additional to keep their nodes running.
I can only operate with my own experience. As you know, I especially and knowingly did not do anything special, all my nodes were setup following the instructions. This is done to prove that documentation is sufficient and reliable.
For example, the raspberry Pi were setup in January 2019 and still working. I have had a few issues though:

  • sometimes the watchtower downloads broken image and the storagenode did not start properly. Happened 3 times. Re-download is a solution (maybe it’s not a watchtower but docker)
  • the OS on microSD were corrupted (I experimented with swap on microSD - it’s a bad idea :slight_smile: ), flashing a new image solved the problem. The microSD is still functional, but seems on its end of the life.
1 Like

I think worst scenario is what we have right now: Quick disqualification without prior notification, without indication and without the possibility to revoke.

Any change in those areas would improve the situation more or less.

4 Likes

I know this is not the point of your post, but this wouldn’t happen. If a node sends bad data to a customer, it’s no big deal as the uplink would know which pieces are bad and invalid erasure codes. That’s one of the great things about reed solomon encoding. You know which pieces are bad and this is exactly what audits use to determine which nodes have returned bad data. As long as there are still 29 good pieces, the uplink would work just fine and retrieve the requested data. If there are fewer than 29 good pieces, it will know that pieces are bad and throw an error. It will never return bad data.

Let me just lead with: SNO - Storage Node Operator
So yes, please don’t add logic to my code, my brain is running just fine atm. :wink:

I think the point is that the node would have to have a way to do a full stack test. No external monitoring software is going to be able to do that if the node won’t accept uploads and downloads that haven’t been signed by any of the known satellites. So as far as I can tell, changes are required. I’m suggesting to go one step further to also allow the node to trigger that test itself and provide an endpoint that can be monitored by any monitoring software.

You’ve seen the concerns raised about that direction and don’t really provide any response or solution, so I don’t have anything to add. It’s still a bad idea.

So, without quoting specific parts as that would make this post way too long. Your suggestion would have the node suspended and excluded after only 2 failed audits with the current scoring system. But by far the biggest problem is that messing with the audit system doesn’t lead to permanent loss. If someone trying to cheat the system can recover, they will try and find the line. It doesn’t matter if you put them in vetting again. In fact, I’d welcome it so I can make sure one of my nodes is always in vetting to get data from both the vetting and trusted node selection cycles. I could use that system to get out ahead. It also requires significant changes to satellite code, while the satellite doesn’t have any information about what kind of failure the node is happening. It’s a blunt force weapon swinging in the blind.

This would have killed almost all nodes during recent troubles with ap1 and us2 though. I have written such a script to stop the node a long time ago, but never put it into effect as it is really an all or nothing approach. I do find it interesting that you find this unreasonable, but don’t mention the same for running an HA setup. I’d say that’s probably the more unreasonable expectation of the two. :wink:

Right, so this is the core issue. And I think most of what we’ve seen recently has some common characteristics. It tends to be that the node becomes unresponsive and doesn’t even log, but apparently does just enough to signal to the satellite that it is still online. I’m with @Alexey that this isn’t all that common, but at the same time, it may be the most common issue that node operators face atm. We’re still dealing with nodes that run into a failure state though and almost certainly not one caused by the node software itself. So ideally we would get better ways to monitor that so we can take better care of our nodes. But I don’t think it’s on Storj to fix the underlying issues here or be more lenient on the requirements.

That is the point. In that case the signal is wrong. The node is online, but it is not in an operational state.
To check the state, node could detect that satellite keeps requesting the same pieces over and over or node could detect that piece has been requested by satellite (download started) but never finished (no “downloaded” status for the piece).

1 Like

Most nodes, yes. Raspberries are not failing one after, neither are desktops. It is certaily possible to run a web or email server on desktop hardware with no RAID and it will most likely work OK for years. There are servers that are running for over 10 years with original drives still in place. Does that mean that I should have used RAID0 instead of RAID1 when setting them up? Probably not. Most people would be able to drive with no seatbelts because most people do not crash their cars :).

I am all for disqualifying nodes that have lost data. Losing data should not happen (but some data may be lost if the node operator does not mount the filesystem with “sync” parameter), at least the node should not lose a significant amount of data.

However, lets take the specific problem that OP had and figure out a way to prevent such problems in the future. As far as I know, OP was running the node successfully for years, until a kernel bug (or maybe a bit flip in non-ECC RAM - ECC RAM is not a requirement) caused the node to time out when responding to audit requests.

  1. The node port was still open, so the recommended monitoring by “uptime robot” and similar services did not indicate any problem.
  2. There was no indication of the problem in the node logs (no ERROR entries).
  3. As I understand, the only indication of the problem would be a decreasing audit score. However, as far as I know, audit score is not updated in real time, so there would be an hour or so before the lower audit score would be displayed in the web dashboard and the API.
  4. The node was disqualified in about four hours after the problem started, so the low audit score on the dashboard was probably there for 3 hours.

So, what could OP have done to prevent the problem from appearing?

  1. Not have the kernel bug - not really feasible to have bug-free OS.
  2. Use ECC RAM if the problem was caused by a bit flip. Maybe the requirement list should be updated to include ECC RAM?

OK, it was not really possible to avoid the problem, so what could OP have done to prevent his node from being disqualified when the problem happened?

  1. Monitor the web dashboard more frequencty than once every three hours.
  2. Use zabbix or whatever to monitor the audit score and send him SMS - if this happened in the middle of the night, wake up and reboot the node VM.
  3. Write a script that would shut down the node if audit score droppeb below, say, 0.9.

If it was possible I could set up HA and some of the tiem be more than 100 meters away from my servers. So, to me this is more reasonable than expecting me to monitor the node 24/7 and always be able to fix whatever problem occurs in less than 3 hours.

Yeah. Some nodes just fail and lose data (the single hard drive dies, operator accidentally deletes the files etc), but these other problems (USB cable fell out, system froze) should be recoverable, since the data is not actually lost.

2 Likes

Interesting point - so you are saying that sending bad data to the customer is not an issue ?

so why do we audit nodes at all ? is the entire process just not required as the network can heal and diagnose itself ?

I think you are right, however this moves the burned of liability in the terms of service from the SNO to Storj labs - there is no safe way to provide a SNO code base, that warrants 100% detection of failure… I can see the forum posts now :slight_smile: My node has been disqualified, and the full stac testing never picked it up… give back all held $$$ now else…

I listed points, they were high level view - contributing code is not good, I not know Go :frowning: again I agree with you something needs to be done, don’t want argue over points Storj would have to choose and I not see it as priority.

where did I say that ? I never mentioned 2 in my post, I think you are miss quoting me as I not explain well :slight_smile: A DQ event would be when your node fails enough Audits, I mentioned %

I leave it there, not got energy to argue today :stuck_out_tongue:

would be interesting to know the kernel output from uname -a to see what kernel was being run, if it’s being pointed at kernel bug - running a node not just about node, it also about responsibility of maintaining the OS, including updates. Also output from Dmesg would help, and also checking for core dump.

Agree with this, only way to be sure no corruption - but again, if we follow official storj advice on node running, well the odds are stacked against you for still having a node in 2-3 years time :smiley:

would be interested how many old operators, don’t use some form of raid - don’t run more than one node to spread the failure, don’t use docker, don’t use virtualization ?

Self-healing is the process of reconstructing correct data, and then ensuring that there’s enough correct pieces in the network. The client’s code performs the first part, but not the second. What’s more, this process need to be performed on a regular basis, and we don’t expect customer to have their clients running all the time like in torrents. Hence the satellites, and the audits.

And sometimes newer versions have new bugs.

In my opinion the recommendations should be such, that if the node operator follows them, he will still have a node after 10 years. This, to me, would be a more honest approach than the current one which basically is “use whatever, it should be good enough… it wasn’t and it’s your own fault for following our advice, you should have done more”.

Then, when a node is disqualified, it would be easy to tell the operator “see, you did not follow this part of the recommendations and that’s why your node is dead now”.

But maybe if RAID, ECC RAM, a monitoring system and such are included in the recommendations, it would scare off some potential node opetators? It would be better, IMO, than just deceiving those operators by saying “doing a, b and c is enough” and then punishing them for not doing d, e and f.

Also, the software itself does not really allow for fault tolerance and recovery:

  1. No support for clustering and HA (which would have saved OP).
  2. No support for backups.

So, in effect, the requirements are more strict than what I would expect from renting a VM in a normal datacenter.

2 Likes

Regarding my hardware, since someone in the thread asked for it:

  • Odroid HC2 with 8 cores, 2GB RAM, SATA port, 1Gbit Ethernet
  • 8TB Seagate IronWolf CMR disk connectet to SATA port
  • 1Gbit download /100Mbit upload optic fiber Internet connection
  • 1000VA UPS which backs up not only the board but also the internet connection (ONT and router)

The board is running Armbian and it is regularly upgraded. The dmesg kernel oops log is in the opening message of the thread linked at top. It’s something which could happen to everyone, like a BSOD on Windows, it’s not operators’ fault because of bad behaviour.

Before the current hardware my node was running on a USB drive connected to a Orange Pi PC2 board (a Rasbperry Pi clone) and I was experiencing continuous suspensions. So I decided to buy a hardware in line with the expected Storj service quality and it was a wise decision.

I can assure you that before this event the logs were absolutely clean, no errors at all, the node was performing well without any problem. Below my success rate report, just extracted from the on line log file: