Put the node into an offline/suspended state when audits are failing

BrightSilence · August 17, 2021, 9:17pm

Yeah, but in those scenarios they can just disable audits for a bit or not update the score or even reinstate nodes that got hit. I’m pretty sure that alarm bells go of quite quickly when all nodes start dropping in scores. And even on the forum this would be noticed within hours.

And yes, there is an incentive for Storj to prevent repair, but they can play the numbers game and on average, with very few nodes running into these issues, I don’t think it’s a major concern. The hit is of course felt much harder by the effected SNOs. That doesn’t mean it should get less priority. I think the SNO concern is very valid and this should be taken seriously as a SNO experience issue. I don’t think the repair costs alone would provide sufficient incentive to prioritize this. So it is up to us to point out that this is a concern. I’m sure that’s starting to become clear to them now.

Pentium100 · August 17, 2021, 9:50pm

The chance can go up with time. That if, it’s more likely for such a problem to occur in 10 years than in a year.
I would be feeling stupid if I lost $1500 because I decided to, say, save a few hundred on a hard drive and ran without RAID.

Or that a node survives long enough to get 18TB and then fails. If a 2 month old node fails, the operator probably would not be as upset about it compared to a 3 year old node failing.

I used “whatever” hardware for v2 and I was changing nodes fairly frequently (due to software problems mainly), but for v2 it did not really matter. For v3 it matters IMO, so, when I first read about the requirements (especially uptime) I went “well, no more playing around, I have to do this properly, the same way I would build someone’s email server”.

Maybe the recommendations should be two-part, kind of like “minimum and recommended” system requirements for games, just called differently, something like “basic” and “expert”. It would be, IMO, a good learning opportunity for newbies on how to build resilient systems (if they want to).

BrightSilence · August 17, 2021, 10:24pm

You can always choose to invest more and upgrade a setup once it gets to that point.

penfold · August 18, 2021, 3:14am

We have also been tracking for wider zero day exploit problems on the host os side. In that respect it is a good thing storj supports multiple platforms since it is unlikely multiple os’es will be hit at once. It is pretty scary that more than one month on MS still has not managed to mitigate PrintNightmare. We have resorted to using third party patch services like 0patch to try and keep our clients safe. This was not a choice made easily but we were really left with no other option given MS’ failure.

Alexey · August 18, 2021, 8:21am

The suspension for audit failure opens door for exploits.
This suspension will be removed eventually when all kind of unknown errors would be sorted out.
For example, “I/O device error” errors are treated the same way as “file not found” - they affects audit score immediately.

Дисквалификация ноды на одном саттелите

2021-08-05T09:53:01.064+0500	ERROR	piecestore	download failed	"{""Piece ID"": ""AGG2PNYANQFNHCYHENR7VW22BEXGVGQPJNFP4A4E2BHZBFYUUWOQ"", ""Satellite ID"": ""12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"", ""Action"": ""GET_AUDIT"", ""error"": ""pieces error: filestore error: unable to open \""T:\\\\v3\\\\storage\\\\blobs\\\\arej6usf33ki2kukzd5v6xgry2tdr56g45pp3aao6llsaaaaaaaa\\\\ag\\\\g2pnyanqfnhcyhenr7vw22bexgvgqpjnfp4a4e2bhzbfyuuwoq.sj1\"": The request could not be performed because of an I/O device error."", ""errorVerbose"": ""pieces error: filestore error: unable to open \""T:\\\\v3\\\\storage\\\\blobs\\\\arej6usf33ki2kukzd5v6xgry2tdr56g45pp3aao6llsaaaaaaaa\\\\ag\\\\g2pnyanqfnhcyhenr7vw22bexgvgqpjnfp4a4e2bhzbfyuuwoq.sj1\"": The request could not be performed because of an I/O device error.\n\tstorj.io/storj/storage/filestore.(*Dir).Open:279\n\tstorj.io/storj/storage/filestore.(*blobStore).Open:75\n\tstorj.io/storj/storagenode/pieces.(*Store).Reader:262\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:530\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:217\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:102\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:95\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51""}"

Some errors (like disconnected drive) already mitigated by storagenode shutdown.
So, later or sooner there would be no “unknown” errors and suspension for failed audits.

The suspension could remain if we could made it expensive for the affected node to make abuse pointless. But in such case it would not be better than disqualification.

What I mean:
if the node failing audits because of timeouts (very easy to exploit for sure), we could put it into suspension (no egress, no ingress, only audits) and starting decreasing the held amount every hour (not longer than 24 hours, if longer the abuse would be still profitable). As soon as it become zero, the reputation is reset and node would start from 75% held back percentage, switched to vetting again, the data considered as lost with all consequences - if a repair job starts, the data on that node will be slowly removed by the garbage collector. At the end of the week, the node will be disqualified.

SGC · August 18, 2021, 1:22pm

first year costs of operation basically even out earnings for 2nd year, so 3rd year is when the optimal returns really start, its not rocket science.
and it’s not an exact thing either just a sort of rule of thumb.

ask around the last i heard a single node solo on an ip took 9 months to vet.
and thats along the lines of what i’m seeing on mine as well, so i cite myself

ofc since vetting is equal to ingress then it can vary depending on when people started, but our month avg ingress have been pretty stable for a long time.

Pentium100 · August 18, 2021, 1:26pm

So, the way I understant it, I should write a script that automatically shuts down my node in case any audit score goes below, say, 0.9. Only then I would have enough time to actually figure out what is happening and fix the problem, because if I shut the node down, I now have about 30 days instead of a few hours.

This should be aded to the recommendations, since it is very important and, I guess, changing the rules allows people to exploit them.

I guess another way would be for me to run my own mini-satellite, figure out a way to add it to the node and then use that mini-satellite to check if the node is working correnctly (and then shut it down if I detect a problem).

littleskunk · August 18, 2021, 1:58pm

After 10 days downtime you might get suspended. Another 7 days grace periode to allow you to fix your downtime issue and in the following 30 days you need at least 20 days uptime. So the point of no recover would be around the 27th day. If you pass that you will not be able to avoid the downtime disqualification even if you would try to be 100% online from there on.

And as a side node if everyone does this I would expect that we simply reduce these numbers to force you to fix it earlier.

pietro · August 18, 2021, 2:11pm

In the meantime in less than 4 hours I lost my 2 years old node which was giving me about $30/month. I started a new identity but in 2 days of running I didn’t get any audit message yet. And I have to wait 100 audits before being vetted and starting receiving more traffic.

To be honest I’m thinking to leave the network and it’s a pity because I can’t believe that in about 3 years of development Storj didn’t find a way to send an email to operators in case of critical audit failure.

Pentium100 · August 18, 2021, 2:14pm

At least I get 10 days.

Here’s the problem - what would be your recommendation for this:

Various software problems (like whate happened to OP) are pretty much unavoidable - at least without the cooperation of the node software (run it in a cluster). There is nothing I can do right now to make 100% sure such a problem will not happen to me.
There are times when I am away from my server, times when I am completely unable to connect to it (for example I am driving somewhere) or I am able to connect, but do not have any time to diagnose the problem (just reboot and hope it helps).

Given that the two conditions are unavoidable (unless you really say that only datacenters with many employees should run nodes), what should I do to avoid getting my node disqualified even if it did not lose any data?

So far, writing a script that just stops the node in case of a problem seems to be the only solution. Then, I have at least a couple of days to connect to my server and figure out how to fix it.

pietro · August 18, 2021, 2:26pm

What about exposing a health check API? When called from the external, the node will perform something like a pre-flight checklist to figure out if everything is fine and if the node is full operative, i.e. by also requesting a control audit from satellites, obviously with a rate limit.

It would be up to Storj to implement this API and make sure that if the API returns “OK” it really means it’s all OK, and it would be up to the SNOs to call and monitor periodically (i.e. every hour or whatever interval Storj decides) the API endpoint and take the appropriate actions.

littleskunk · August 18, 2021, 2:31pm

I know what I can do to avoid that. I would take a look at the storage node code and figure out why the storage node is not detecting the issue and going into a crash loop as intended. I have even gone so far to plan to write a few more unit tests to just throw different errors into the corresponding function. If I can proof with a unit test that the storage node ignores certain errors it should be easy to just add a few more commits to make that unit test pass.

Do I need to write that unit test and the fix myself? I don’t think so. Everyone with a bit of development skill is welcome to do this. I am happy to point you to the corresponding code lines and also lure in a few people from storj to provide additional information. I would love to enable the community to work on smaller issues themself.

Pentium100 · August 18, 2021, 2:47pm

I really doubt you would want my “Go” code in your nice software. While I may be able to hack something together for myself (like I did with v2 at one point), I don’t think I should subject other people to my “mods”.

My problems with disqualification are mostly about the time it takes for the node to get disqualified (or, in other words, the required reaction time from the operator) - it should not be too short so that people who do not have employees would have enough time to fix a problem that did not result in lost data (obviously if the files were deleted or the drive died in a setup without RAID there’s nothing an operator can do, but USB cable falling out or the OS freezing does not mean the data is lost).

BrightSilence · August 18, 2021, 2:56pm

It sure isn’t, because that rocket would not get off the ground. You’re making lots of assumptions about setup and since the recommended setup is to use what you have and what would already be online, cost of operation should be close to 0. In which cases statements like this don’t make any sense. Either provide an example setup or stay away from generalized statements altogether. It’s not helpful.

I ask around constantly as this information is important to keep the earnings estimator as close to reality as possible. I’ve never heard anything over about half a year and even then that’s only for the last satellite. And the only reason for that is that that satellite is basically inactive anyway, so it would have pretty much no impact on your nodes income. You also again provide no actual numbers and skip describing the setup. Maybe you have several nodes in vetting so they share the same vetting ingress. I can’t incorporate such reports in earnings estimator feedback because they are not reliable. Luckily there is enough info that is reliable out there so I can still come close to a realistic estimation.

Well, to be fair, if this becomes a serious enough issue I would expect Storj Labs to pick it up. It would be awesome of course if someone well versed enough in Go would be able to do this, but for serious problems you can’t rely on that. But we can’t dictate your backlog and priorities. All we can do is point out a problem that has popped up a little more frequently than we’d like to see on the forums. I’m personally still not certain if this should be picked up with priority, but I’m monitoring it as it seems to have recently happened a few too many times as well as with pretty long time reliable SNOs. I at least think that the devs should be aware of it and there should be a spike on the backlog to investigate. But at this point I can imagine it’s not yet a priority to pick up now.

littleskunk · August 18, 2021, 2:57pm

I would still take your PR. Someone else could clean up your code to the point that we can merge it.

clement · August 19, 2021, 4:33am

Your PRs are always welcome. You will have people to review, provide the needed assistance and suggest a few changes to polish it till we can merge it.

victorelec14 · June 25, 2022, 10:02am

Hello,

I think it might be good to implement a function that stops a node automatically when it presents errors (Faileds upload) in order to avoid the suspension of the node in the satellites.

This is not the first time it happens to me but before yesterday I was copying files to a hard drive where my Storj node is and that increased the “IO Delay” significantly, after that the node did not report problems while copying files at night, The next morning I see in my node panel that I am suspended on satellites (probably for a long time serving the file which caused upload fails).

The ideal thing would be to implement a function that when it detects that 10% (for example) of the requests are Failed, it stops the node so that the operator can see what happens before ending up completely suspended in the satellites.

I must say that since I was suspended I have not yet received the typical email from Storj that alerts you that you have been suspended in X satellite, I also think that is another point to improve and give priority in the Emails Delivery so that those emails go out first to the Operators.

Thank you

victorelec14 · June 25, 2022, 10:26am

please merge with this post, sorry!

Pac · June 26, 2022, 9:35pm

[…]

It feels to me like what you’re asking for is exactly what suspension is.

The real issue is not suspension, but disqualification. To avoid disqualifying nodes that behave in a wrong/unknown way (but still respond correctly to audits), they get suspended first so SNOs have time to investigate what’s going wrong.
You’re suggesting to stop nodes, and that’s kinda what happens when they’re suspended: they’re paused (no ingress) until they get fixed (or disqualified for good if they do not get fixed in time).

[EDIT: The above is confusing, see @BrightSilence’s added details below for a clearer explanation]

Is that not covering what you have mind?

I agree, notification e-mails could use some improvements.

My understanding is that this topic is about avoiding (very fast) disqualification that strikes without any suspension time window, so maybe we could suspend nodes that are about to get disqualified, but there’s debates about that idea.

BrightSilence · June 27, 2022, 6:43am

I’m not sure what you’re getting at here. But there is no order of first being suspended and then disqualified. It just depends on the type of errors. If you fail an audit with a missing file or by responding with corrupt data or repeated time out, it will hit your audit score and if that drops low enough, you get disqualified.

If there is a different kind of error that isn’t one of those really bad ones, it hits your suspension score and your node gets suspended if it drops low enough and you can recover.

If you’re offline the online score drops and you get suspended. You can recover if you’re not offline for 30 consecutive days.

They are separate systems.