Put the node into an offline/suspended state when audits are failing

Sure, as long as there are no false positives. However, IMO it would be difficult to make sure the node has lost data. Probably, the most likely indicator is file corruption, but even that could happen because non-ECC RAM had a bit flip. “File not found” and timeouts can easily happen because of some problems that make the data temporarily inaccessible, but not actually lost.

3 Likes

When the node doesn’t reply to an audit with valid data after 5 minutes, what if the satellite was interpreting that as if the node was offline instead of failing the audit? Would that solve the problem as the only way to fail an audit would be to actually reply with invalid data or a “file not found” reply?

A node that lost too much data would still get disqualified fast, but nodes that can’t keep up for some reason would be considered offline until they either recover from whatever was overloading them, or until they go below 60% of online score, at which point it would get suspended and the SNO (hopefully) would get notified something is going wrong?

The only downside I see here, is that someone could tweak the node software so it does not reply when the satellite asks for a data it’s not holding anymore. But that would simply slowdown its inevitable disqualification for a few days… right? :thinking:

did some math on it a while back, takes about 2 to 3 years until a storagenode really starts earning optimal amounts.

takes close to a year to vet these days and the recommendation is to use a single hdd which isn’t really reliable enough for many nodes to even reach the optimal earnings.

and really nearly a year to vet and just a bad sata cable causing issues and the node is DQ in like 4 hours… seems very unbalanced… people should have a chance to fix issues.

i don’t see why this has to be so controversial, its about as clear an issue as a bullet to the head.

1 Like

That sounds a lot like you did some opinion, not math. A general statement like this makes no sense as it doesn’t mention upfront costs, recurring costs or specify what optimal amounts are. Honestly I have no clue how to interpret this.

[Citation needed]

That part isn’t controversial. Everyone agrees on that. We’re just debating different ways to allow that and weighing in other aspects like opening up the system to abuse. I haven’t seen anyone mention that people should not be allowed time to fix temporary issues.

This year, my node (and probably every other vetted node) received, on average, 425GB/month ingress. Assuming no files are deleted, this works out to 5.1TB/year. So, it would take ~3.6 years for my node to get back to storing 18TB of data.

Due to transaction costs, there were times when my node did not make enough to get a payment every month. it usually makes $50-60. About half of this is for storage ($1.5/TB * 18TB = $27) and the rest is for egress (looks almost equal amounts of repair egress and download egress).

I do not know if newer or older files are accessed more for egress, but I would think egress would also depend on how much data the node has (a node with 9TB of data probably gets about half of the egress my node gets). I would say a node earns ~$3.3 for TB stored.

So, a year after the new node is vetted, it would have about 5TB and up to that point would have earned $76, starting from zero and getting $14 on the 12th month. I do not know what transaction costs are now, but the node would probably only get one or two payments.

After 3 years, it would be 15TB and ~$50/month which is approaching the max amount (because files are deleted, the node would not expand by 425GB every month forever, it would slow down).

2 Likes

I’m well aware of recent network performance and in fact I have bad news for you. It’ll take more than 6 years if you take deletes into account. For reference check the earnings estimator.

It does, but more recent data seems to be a little more frequently downloaded as well. It’s not all that clean cut. I have some older nodes that have been full for a while and they don’t see a lot of egress.

About $3.80 according to the earnings estimator, but please let me know if any numbers/settings seem off to you.

Vetting ends at different moments on different satellites. Some may take long, but that’s usually because they also see low activity, so the ones that take long to vet are also the ones that have the lowest impact on your payouts. For the rest of this message I will just post a screenshot of what the first 2 years look like.


If anyone reads this message more than a few months from now, be sure to check the link I posted earlier. Network behavior may have changed.

Yeah, that slowdown starts to have a significant impact in year 3. By the end of it you will have closer to 11TB actually, not 15TB. The max potential (where deletes match ingress) is currently around 22.5TB. The closer you get to that the slower used space will increase.

So yeah, I mostly agree with your numbers with some refinements. I don’t think that negates the part you quoted though. Rather than making blanket judgements I much prefer providing information like you did as well and leaving the judgement to everyone for themselves.

2 Likes

I was not really trying to get the numbers completely exact and just used the performance of my node for this - I also did not account for held amount. Cool that I got it almost correct.

So, my node currently gets about $60/month and, according to your estimator (default values) it would take 63 months for a new node to get to that level. My node would earn $3780 in that time and a new node would earn $2226. This means that getting disqualified and restarting results in $1554 of lost income. That is a bit harsh punishment for being away from the server for 4 hours (probably less than that if the USB cable fell out of the recommended setup) and also why I recommend using what I consider a proper setup.

2 Likes

From the network-as-a-whole perspective, that should also trigger repairs at some point. If a node with 15TB goes down, that might potentially be quite a lot of data to repair. So there actually is an incentive to Storj to work on a good trade-off, so that in case of a transient problem large nodes don’t disappear in 4 hours.

Also, an important thing here is that Storj whitepaper math ignores the problem of correlation of failures, and it might be exacerbated by common transient problems—like the recent bug that resulted in many nodes suddenly failing audits for pieces the satellite mistakenly thought should be there.

Both of these would really benefit from defining exactly what kind of adversary is being considered when doing failed audits disqualification, because—as I believe—a different trade-off from the current status quo might be necessary.

1 Like

Yeah we have no argument on that front. And if there is a way that can be prevented without allowing abuse of more lenient rules, I’m all for it. And I think there are some ways that could be done.

Sure, but you didn’t factor in the chance of that happening. For an up to 5 year old HDD there is about a 2% chance per year of failure. It’s really impossible for me to calculate the chance of this system freeze issue happening, but let’s say there is a 10% chance each year that your node gets lost that would have been preventable with the kinds of setups you are suggesting. That means you would have at best a budget of $155 to break even and less to make a profit on that investment. And that’s assuming you already have a node with 18TB of data. For new node operators that number would be much closer to 0. So maybe at some point that investment makes sense. But for new SNOs starting out, I simply can’t justify large upfront costs. For all they know Storj has failed and disappeared before you ever earn that money back. I’m still very happy with my “whatever I already had” setup. The only thing I ever bought was more HDDs to add more storage space. And I only did that when I could pay for it with my Storj tokens. That’s 3 nodes running for 2.5 years now. And my nodes perform and earn exactly the same as yours do. So so far I’m definitely at a larger net profit. (Assuming you spend any kind of money on the kind of resilient setup you’re talking about).

Yeah, but in those scenarios they can just disable audits for a bit or not update the score or even reinstate nodes that got hit. I’m pretty sure that alarm bells go of quite quickly when all nodes start dropping in scores. And even on the forum this would be noticed within hours.

And yes, there is an incentive for Storj to prevent repair, but they can play the numbers game and on average, with very few nodes running into these issues, I don’t think it’s a major concern. The hit is of course felt much harder by the effected SNOs. That doesn’t mean it should get less priority. I think the SNO concern is very valid and this should be taken seriously as a SNO experience issue. I don’t think the repair costs alone would provide sufficient incentive to prioritize this. So it is up to us to point out that this is a concern. I’m sure that’s starting to become clear to them now.

The chance can go up with time. That if, it’s more likely for such a problem to occur in 10 years than in a year.
I would be feeling stupid if I lost $1500 because I decided to, say, save a few hundred on a hard drive and ran without RAID.

Or that a node survives long enough to get 18TB and then fails. If a 2 month old node fails, the operator probably would not be as upset about it compared to a 3 year old node failing.

I used “whatever” hardware for v2 and I was changing nodes fairly frequently (due to software problems mainly), but for v2 it did not really matter. For v3 it matters IMO, so, when I first read about the requirements (especially uptime) I went “well, no more playing around, I have to do this properly, the same way I would build someone’s email server”.

Maybe the recommendations should be two-part, kind of like “minimum and recommended” system requirements for games, just called differently, something like “basic” and “expert”. It would be, IMO, a good learning opportunity for newbies on how to build resilient systems (if they want to).

1 Like

You can always choose to invest more and upgrade a setup once it gets to that point.

We have also been tracking for wider zero day exploit problems on the host os side. In that respect it is a good thing storj supports multiple platforms since it is unlikely multiple os’es will be hit at once. It is pretty scary that more than one month on MS still has not managed to mitigate PrintNightmare. We have resorted to using third party patch services like 0patch to try and keep our clients safe. This was not a choice made easily but we were really left with no other option given MS’ failure.

The suspension for audit failure opens door for exploits.
This suspension will be removed eventually when all kind of unknown errors would be sorted out.
For example, “I/O device error” errors are treated the same way as “file not found” - they affects audit score immediately.

Some errors (like disconnected drive) already mitigated by storagenode shutdown.
So, later or sooner there would be no “unknown” errors and suspension for failed audits.

The suspension could remain if we could made it expensive for the affected node to make abuse pointless. But in such case it would not be better than disqualification.

What I mean:
if the node failing audits because of timeouts (very easy to exploit for sure), we could put it into suspension (no egress, no ingress, only audits) and starting decreasing the held amount every hour (not longer than 24 hours, if longer the abuse would be still profitable). As soon as it become zero, the reputation is reset and node would start from 75% held back percentage, switched to vetting again, the data considered as lost with all consequences - if a repair job starts, the data on that node will be slowly removed by the garbage collector. At the end of the week, the node will be disqualified.

2 Likes

first year costs of operation basically even out earnings for 2nd year, so 3rd year is when the optimal returns really start, its not rocket science.
and it’s not an exact thing either just a sort of rule of thumb.

ask around the last i heard a single node solo on an ip took 9 months to vet.
and thats along the lines of what i’m seeing on mine as well, so i cite myself :smiley:

ofc since vetting is equal to ingress then it can vary depending on when people started, but our month avg ingress have been pretty stable for a long time.

So, the way I understant it, I should write a script that automatically shuts down my node in case any audit score goes below, say, 0.9. Only then I would have enough time to actually figure out what is happening and fix the problem, because if I shut the node down, I now have about 30 days instead of a few hours.

This should be aded to the recommendations, since it is very important and, I guess, changing the rules allows people to exploit them.

I guess another way would be for me to run my own mini-satellite, figure out a way to add it to the node and then use that mini-satellite to check if the node is working correnctly (and then shut it down if I detect a problem).

2 Likes

After 10 days downtime you might get suspended. Another 7 days grace periode to allow you to fix your downtime issue and in the following 30 days you need at least 20 days uptime. So the point of no recover would be around the 27th day. If you pass that you will not be able to avoid the downtime disqualification even if you would try to be 100% online from there on.

And as a side node if everyone does this I would expect that we simply reduce these numbers to force you to fix it earlier.

2 Likes

In the meantime in less than 4 hours I lost my 2 years old node which was giving me about $30/month. I started a new identity but in 2 days of running I didn’t get any audit message yet. And I have to wait 100 audits before being vetted and starting receiving more traffic.

To be honest I’m thinking to leave the network and it’s a pity because I can’t believe that in about 3 years of development Storj didn’t find a way to send an email to operators in case of critical audit failure.

3 Likes

At least I get 10 days.

Here’s the problem - what would be your recommendation for this:

  1. Various software problems (like whate happened to OP) are pretty much unavoidable - at least without the cooperation of the node software (run it in a cluster). There is nothing I can do right now to make 100% sure such a problem will not happen to me.
  2. There are times when I am away from my server, times when I am completely unable to connect to it (for example I am driving somewhere) or I am able to connect, but do not have any time to diagnose the problem (just reboot and hope it helps).

Given that the two conditions are unavoidable (unless you really say that only datacenters with many employees should run nodes), what should I do to avoid getting my node disqualified even if it did not lose any data?

So far, writing a script that just stops the node in case of a problem seems to be the only solution. Then, I have at least a couple of days to connect to my server and figure out how to fix it.

4 Likes

What about exposing a health check API? When called from the external, the node will perform something like a pre-flight checklist to figure out if everything is fine and if the node is full operative, i.e. by also requesting a control audit from satellites, obviously with a rate limit.

It would be up to Storj to implement this API and make sure that if the API returns “OK” it really means it’s all OK, and it would be up to the SNOs to call and monitor periodically (i.e. every hour or whatever interval Storj decides) the API endpoint and take the appropriate actions.

1 Like