Restarted Server And Node Got Suspended?

Still suspended. I figured us2.storj.io:7777 satellite to be accountable for about 0.001% of all node traffic. On all nodes. Not sure why it got suspended by not passing just 1 audit out of 2, but says 48.72% on the dashboard. I believe the code should be improved. You can’t just suspend, if it fails a second audit since start. This is just not good coding. Make it suspend, if it fails 3 of 5 ratio or something. 1/2 is bad. Logs…

root@server:/tmp# journalctl -u node | grep GET_AUDIT

Jan 21 11:28:54 server storagenode[59218]: 2022-01-21T11:28:54.244Z INFO piecestore download started {“Piece ID”: “QUDRCIVXYTNFZD2JZRGOEXPKCBUEGVSWVWQNQDOFQBO3ZKZB7R5A”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “Action”: “GET_AUDIT”}

Jan 21 11:28:59 server storagenode[59218]: 2022-01-21T11:28:59.249Z ERROR piecestore download failed {“Piece ID”: “QUDRCIVXYTNFZD2JZRGOEXPKCBUEGVSWVWQNQDOFQBO3ZKZB7R5A”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “Action”: “GET_AUDIT”, “error”: “untrusted: unable to get signee: trust: rpc: dial tcp: lookup us2.storj.io: Try again”, “errorVerbose”: “untrusted: unable to get signee: trust: rpc: dial tcp: lookup us2.storj.io: Try again\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:140\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:62\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:497\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:104\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:97\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52”}

Storjlings, please comment.

If the node is new, it’s being vetted. This period is there to ensure your node is “worthy” of the network, or should I say “healthy”.

As this stage, I think it’s actually a good thing if it fails and gets suspended quickly so the SNO can investigate and figure out a fix before the node joins the network for good.

Nodes usually don’t get suspended after one or two issues. My nodes can fail 10+ audits before being in (serious) trouble. But the younger a node, the more it gets affected by online failures.
Besides, your node did not precisely fail audits otherwise its Audit score would have dropped, instead an unknown error occurred during an audit. So it got suspended to let you know you need to fix it (during a limited time window) before it is allowed to go again.

With regards to the precise issue you’re facing (unable to get signee), personally I’m not sure what it means. The only kinda related thread I could find is the following, where @Alexey suggests a possible network issue:

Someone else than me might know better.

Really now? Then why did this even happen?

It did drop.
image
Was 100% before I rebooted the machine because of an update. Would you suggest ditching the node and starting a new one instead of it? It only earned a few cents for now - $0.17.

As I said, “the younger a node, the more it gets affected by failures”. For example, the online score is a 30 days moving average for nodes that are 30+ days old. Which means that if one of these nodes is offline for 1 day, its “online” score would drop to +/- 96%.
However, my understanding is that if a brand new node runs for 1 day, then gets offline for 1 day, its online score would probably drop immediately to 50% or so, because it would have been offline 50% of the time since it’s online.
EDIT: Only true for online score apparently, see @BrightSilence’s answer further down.

I’m sorry but I can clearly see on your screenshot that your Audit score is still perfectly fine (100%), but your Suspension score did drop however.
So I think we do agree that your Audit score did not drop. Right? :slight_smile:

Not really, not yet at least: nodes are supposed to recover from suspension if the issue is solved. As it is new, it may stay in this state for a while (few days) as audits are quite rare at this stage. If a few successful audit hits your node, I believe its suspension score should recover to over 60%, and the node should be reinstated.

If your audit issue was just an unluck combination of factors because of the reboot, then I guess it will be fine and there is not need to replace it with a new node.
If the issue is still there however, this node’s scores will eventually get worse showing that something needs to be fixed on the system/network before starting a new one.

So either way, I’d keep this one and keep a close eye on it to see how it evolves.
Worse case scenario, this node dies within a couple of weeks, and you can replace it then - but hopefully it will not need to come down to that :slight_smile:

1 Like

What I initially thought, really. Thanks!

OK. Agreed. Thanks! I guess I will check on it in a week or so. Should not have connection issues. It is on very good and working hardware.

1 Like

This is a wrong assumption as that is not how the audit and suspension score formulas works. Dropping to this level can only happen after failing many audits.
I’ve mentioned before that the audit and suspension scores should never be displayed with a % sign as they aren’t percentages of anything. They are the outcome of a scoring formula that attempts to best represent node reliability. The actual scores are actually between 0 and 1. The way it’s currently displayed always leads to confusion.

I’ve done a lot of work in the past on researching the audit scoring system and suggesting changes to tune it. Failing 1 out of 2 audits would at most drop your score to 0.95 (95 on dashboard).

More info in this topic: Tuning audit scoring

@Pac 's comment is true only for the online score, which can rightfully be presented as a percentage, since it approximates the % online time in the past 30 days.

That said, I do agree with the approach, you need to know whether the issue still exists. Starting over isn’t going to help you if it does, it’ll just postpone the inevitable. And you can fix it and recover with the existing node, while still progressing on the other satellites.

Besides, you don’t need to worry about missing out on payouts from this sat… these are my earnings this month so far per satellite.


My lifetime earnings on us2 are a whopping $0.18. You should still fix it if there are problems as they might impact other sats as well, but this satellite is not a reason to kill your node over.

2 Likes

Hm :thinking:

I guess I’m still confused! :sweat_smile:
Thanks for clarifying.

Please, God, tell him that is is exactly why I am concerned. It is obviously not so. Mine dropped below 50%. It happened and logs confirmed. If it was doing more audits during the reboot, i cannot know.

The other problem is this may change. No matter the sat is not that active now. It may be more active later.

OK. Final decision - not killing node to see what happens.

The audit process on your node may not even have gotten to the point where it is logged. The scores are based on satellite observations, not on your nodes own. Keep in mind that there can also be a significant delay between the failures and the score drop. The restart is most likely what triggered updating the score, but the issues may have happened many hours before that. Do you even have all the logs for that?

Either way, without more info I can’t say what errors happened, but because of how the scoring formula works I can say the satellite must have marked your node with quite a few errors, not just one.

I don’t expect it to. This satellite was created for feature testing on the satellite interface. I participated in some small tests on it a while back. It’s not used by customers and I don’t think it will be used for load testing as Europe north and salt lake are usually used for that. Things can always change I guess, but it doesn’t seem likely that this will ever be a big sat.

1 Like

We should be a bit more clear if a suspension score doesnt recover after a few hours and the node still not getting any data that means the issue still persists. Suspensions do not last for longer then it has to if the issue doesnt exsist.
But if the issue still exsists then the suspension score does not recover. But if its a unused sat the issue may never reover till this sat is used again.

Must be tough for the node to get suspended:/ without logs it’s impossible to tell what went wrong. But the logs you posted it had something to do with ping failed. I only have one node that is 12 months old. Never had it failed an audit/suspended due a reboot.

Makes sense, so I guess I should give it even more time. Was told to make sure that the node would not fail any audits anymore and suspension score will recover.

Yeah some reason I thought it was US1 and I expected it to recover fast, But its US2 which is hardly used right now. But I wouldnt even worry about that sat since all month my nodes seen maybe 70mb…

1 Like

Suspension scores only change when an audit happens and the satellite sends the new score back to the node. Especially on new nodes that can take a long time if they don’t have much data stored yet. If the issue persists, the score goes down, if not, it goes up. No change means no additional audits have happened and we can’t yet draw conclusions from that then.

2 Likes

According to your posts, your node is still increasing its earnings.

The specific error seems to be listed only once in the source code, in the VerifyOrderLimitSignature function. Start reading from the function comment on line 130:

https://github.com/storj/storj/blob/1c47163eeeba139670f690811b5ac27a159d590e/storagenode/piecestore/verification.go#L130

I don’t think this error has a direct relationship to the Suspension score. Perhaps the reboot of your system screwed up the signature process on one of your orders and at the same time caused a timing issue with the data flow.

1 Like

If there was one wrong time out of millions to reboot a server, that was definitely it. :rofl:

2 Likes

It’s unlikely the reboot was the original cause. For a satellite to suddenly start auditing your node many times in that short a timeframe is basically impossible. I think the score just didn’t update until you rebooted, but the failures must have already happened prior to that. So don’t get too stuck on that reboot.

Any change in score yet? Did you see any additional audits for that satellite in the logs?

3 Likes

No. Pretty much all nodes I have do not get much traffic from this satellite. Same for other posters around here, obviously. Will leave it as is, so when another audit happens, it improves.

Update: node is recovering.
image

5 Likes

is it a very new node?