Put the node into an offline/suspended state when audits are failing

@BrightSilence You’re explanation is way clearer than mine, as always :+1:.
I do agree with what you’re saying so I’m not sure where my post is wrong, although I admit it is badly put together :thinking:

True, that’s why I ended my post with a suggestion that maybe we should do that.
But hey, maybe it’s a stupid idea :slight_smile:

I may have misinterpreted this line then.

Which is very well possible, as I was confused by its self contradictory nature. Nodes that behave in a wrong/unknown way are detected through incorrect audit (or repair) responses. And the “they get suspended first” suggested an order of sorts.

After your response I’m guessing your intention was to say that nodes that don’t have critical audit failures, yet fail with unknown errors get suspended. And the “first” was referring to that eventually if the node never recovers it gets disqualified as well.

I agree on that part, I’ve suggested something similar elsewhere. The idea was to very quickly suspend the node to protect data and start repair and then be slightly more lenient with permanent DQ by allowing the node more chances to respond to the same audit. I’ll try to find the link.

Edit: Got some of the details wrong, it’s been a while, but here it is. Tuning audit scoring - #52 by BrightSilence
Please note that the suggestion is in context of the topic which had suggestions to stabilize the scores as well. The numbers in this post assume those suggestions are picked up in addition to the suggested suspension/disqualification change.

1 Like

There should not be a reaction time requirement of a few hours. The node operator should be able to leave the node completely unattended for at least two days and still be able to recover it if the failure is not permanent (dead hard drive etc).

1 Like

Two days?
7 days at least. Better 30 days. I mean people have a life, they have family. The go on business trips or leisure vacation. Also they get sick, have emergencies or technical failures like internet/power outages or whatever…

1 Like

I agree. This is a bit of a balancing act of course, because you want to protect data asap, but don’t want to immediate impose irreversible penalties. This is why the suggestion I linked suggested to very quickly suspend to protect data and start repair, but allow for up to a month to resolve the issue and recover from that suspension. While suspended you will of course miss out on ingress and lose data to repair, but this isn’t nearly as bad as losing the entire node and serves as an incentive to fix things as fast as possible. That seems totally fair to me.

As for the data protection side, that suggestion also ensures that if you did lose data or mess something up in an irreversible way, you will never be able to escape that suspension and eventually be disqualified. For the data protection side of things it doesn’t matter much if you’re disqualified or suspended as the data protection measures are the same for nodes in those states.

So this is very possible to do in my opinion. We had some good back and forth in that topic, but it seems to have been deprioritized. I get that, it’s not really necessary atm for (better) functioning of the product or protection of the data. But I still hope something like it will at some point be implemented.

1 Like

Exactly. The issue is about not losing a valuable node entirely because of some recoverably mistake. Maybe the duration can be depend on node age. For every month of node age you get 1 additional day of recovery duration. Something like this…

This seems fair, as long as it is possible to not have the node disqualified. With the rules as they are right now, if I wanted to go on vacation for a few days where I would have bad internet connection or no internet at all, I would have to rent a space in a datacenter for my node and have someone else manage it. This is true even though I have two internet connections and a large UPS (and soon to have a generator), because if the node manages to freeze in a weird way (IO subsystem freezes or something), it can get disqualified in a few hours, probably faster than I would be able to connect and reboot or shutdown the node.

I think some of this comes down to it being more about notification than anything else. A node should probably be aware of when it is experiencing severe error conditions and notify via email of those conditions when possible. Ideally this should be baked in to the node software itself.

In addition, if a node is failing audits, after some threshold an email should be shot off that informs the node operator of this threshold being reached, so they can investigate timely.

I think as capacities of nodes increase, Storj will have a greater interest in participating in keeping node’s data availability high so as to reduce toil on the network in rebuilding missing data. In the short term though, the efforts continue to focus on improving the overall network features, performance, and stability and so I think some of this polish will come at a later time.

Better notifications will certainly help, but they are not a full solution. On larger nodes audits happen quite frequently and that is more the case now that repair is also used for audits. Fail 10 consecutive audits and that node is gone forever. This can happen easily before you have time to respond to a notification.

I’m gonna disagree with you here. It’s very well possible that the node software itself is hanging and not responding correctly. It will then also not be able to notify the operator, which pretty much defeats the purpose. The best is to have something external from the nodes network so it can check that the node is externally accessible and able to respond to requests for pieces. The satellite could do this, but perhaps an option of an externally hosted audit service could also do that trick. Maybe even something built into the multinode dashboard, which can already be hosted remotely, in the cloud for example.

The largest nodes already store 22TB+. And at the moment they are actually at the highest risk of super quick disqualification, because the number of audits and repair both scale with data stored. That’s kind of the opposite of what you want. The nodes which would hurt the most for the network to lose are also the ones most likely to fail this way. And that’s not good for the node operator, nor for the network as a whole.

That’s totally understandable. And I think there have been great improvements from that end already. It’s clear that the teams are doing great work. As long as this part isn’t forgotten completely. I think it will have to be tackled at some point.

4 Likes

A bit of a side-topic, but that is something I still don’t get: Why data that got repaired while being offline gets removed from the “faulty” node when it gets back online: Leaving the data on that node would mean extra pieces for the network, which in turns means less repair in the future, statistically. It feels to me like removing those pieces from the node that was offline for a bit is like shooting ourselves in the foot…

Yeah, this point! Especially as timeout counts as an audit failure, hurts the most. A large node puts some small amount of I/O pressure: each upload and download triggers writes to the orders and bandwidth databases—AFAIK usually delayed (the orders database being a plain flat file, the bandwidth database running WAL) Plus, obviously, each upload triggers an fsync on the stored file. If, then, the node starts the file walker process, and at the same time some unrelated tasks also start doing a lot of I/O on the same drive, especially requesting syncs that will also force writes to the orders/bandwidth databases, we may accidentally run into timeouts.

Allow me to be wrong, but I suspect handling this kind of problem by disqualifying a 3 year old node with 20TB of data might not be the right choice. Instead, it would be more polite to request SNO to tune down the unrelated tasks.

After the repair the network already has the full set of 80 pieces (or whatever the current number is). 80 is supposed to be already a safe number, and the extra pieces will cost the network.

1 Like

A little of extra-storage cost for those segments that got under the threshold and thus got repaired, but on the other hand many other segments cost less in the meantime because their number is in between the minimum threshold and 80 pieces. I dunno, I’m only partially convinced.

I get your point though, and I see how this would statistically cost a little more to Storj I guess. I’m wondering if it’s significant… Thanks for this clarification :+1:.

1 Like

Well there are a few extra things to consider and some of it depends on how repair works. Initially piece 1 through 110 are created and 80 of those get stored on nodes. All these pieces are unique, which is important, because the way they are created allows any piece to be lost as long as other pieces are still there, the data can be recovered. Repair has always been described as recreating the lost pieces. This suggests that the same piece is recreated again. If this is the case, storing the exact same piece doesn’t provide any additional protection as it only protects against the loss of that specific piece on the other node, not of any other piece.

Now in theory I don’t think there is anything stopping them from creating pieces number 111+ during repair. This would just create new unique pieces that can provide the full protection that any other piece does. Should the original piece reappear in that scenario, it would still be viable and could count for piece availability again. It might already be implemented like that, I’m not sure.

But there is also the other side… Maybe it’s not such a bad thing to have an incentive for SNOs to limit issues or pay some price for it. They node was unreliable for a bit, so maybe it’s also not a bad idea to lower the amount of data stored a bit as a result as well. As long as the node can recover, I think those are very fair penalties to impose on a node. You won’t really lose that much data to begin with.

I also wonder how many segments need to be repaired more than once. Basically every time a segment is repaired, it keeps the pieces on nodes that have been reliable for considerable time and ditches those that weren’t. So in theory the chances of the segment needing repair again will be much lower. It may not be all that valuable to keep more pieces sticking around if a second repair is unlikely to begin with.

2 Likes

Oh. Because pieces being repaired are the ones that are currently unreachable (because the node is offline) I assumed they were fully “regenerated” (as in, new unique pieces). You raise a very valid point here, never thought repaired pieces could be the exact same ones, in which case it would actually be completely useless to keep them on the offline node, you’re totally right.

Kind of, stay offline for 2 days and you can lose weeks worth of data. But I get your point about the incentive resulting from this.


All that makes me curious, I’d love to hear some more technical details from some Storjlings to better understand if losing data when offline is a technical necessity, a side effect, an incentive or if it’s there for any other reason :slight_smile:

As long as the “repaired” piece gets deleted from the recovered ode I would be OK with that. The main problem currently is the required reaction time.

1 Like

It won’t be immediate, but garbage collection will catch it. So that’s fine.

But I agree, the biggest deal is the reaction time, especially on larger nodes. I haven’t seen anyone lose a 20+TB node yet, but that would really suck. It’s why I’ve started running more nodes now. Though of the 22TB I have almost 18TB is still on the largest node. Really would suck to lose that one.

4 Likes

Especially if the cause of the loss is something I would perceive as “unfair”. If my zfs pool blows up and I lose data, well, I only have myself to blame. OTOH, if my node gets disqualified because the IO subsystem froze and I did not notice it in two hours (or noticed it but was unable to fix it because I did not have internet connection at the time), oh yeah, I’d be really unhappy with that.

3 Likes

It happened again, I just lost a node that they created in November 2020 and with more than 4TB of data. :sweat:

I was passing the data from one node to another disk and the permissions changed, then when I turned on the node it thought that the files did not exist because it did not see them and it started giving Audit errors.

Soon my node was completely disqualified, the node did not stop due to having so many errors, I never received an email that something was wrong.

I still have the information and it is not damaged, it was just a permissions problem that the node did not know how to deal with and disconnect.

I think it is important to solve this because it is not the first time it has happened to me and to other people.

An old node of almost 2 years has the same amount of errors allowed as a newly created node? Or does it seem to me that the real percentage of files does not count, but of errors that have occurred?

1 Like

I don’t know if now I just have to delete the node completely with its files or what to do with the data. The disqualified node will not come back online so the data will just sit there taking up space.

This is also related to this post.