Restarted Server And Node Got Suspended?

svet0slav · January 21, 2022, 2:55pm

Not really. It has some nice settings you can play around with. It keeps each reboot logs separately. And you can tell it to keep X much of size of data in total or clean itself on other conditions.

deathlessdd · January 21, 2022, 2:55pm

Id just keep it going to see.

svet0slav · January 21, 2022, 2:55pm

Alright. Let’s see what happens…

deathlessdd · January 21, 2022, 2:56pm

Ahh I always found it a pain to use journal for anything… Its so much easier to check a log file and date it for each month.

svet0slav · January 21, 2022, 2:57pm

Up to you, really. Let’s see what happens to the node. I will post updates here.

baker · January 21, 2022, 3:07pm

My suggestion would be to a) let the node run and see if it will recover from suspension. Suspension just means you will only have egress until the node recovers to over 60%, and b) try and determine the cause of suspension before starting a new node. In my experience suspension is not random, and whatever problems caused suspension for this node will likely effect the replacement node.

As a side note:

svet0slav:

WARN        contact:service        Your node is still considered to be online but encountered an error.        {"Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Error": "contact: failed to dial storage node (ID: XXXX) at address nodex.domain.com:28967 using QUIC: rpc: quic: timeout: no recent network activity"}`

This indicates that UDP is not forwarded properly to the node, and although won’t cause suspension, you might want to fix that unless you have intentionally not forwarded UDP.

svet0slav · January 21, 2022, 3:14pm

Should be fine. What I use:

iptables -t filter -A INPUT -p tcp --dport 28967 -j ACCEPT
iptables -t filter -A OUTPUT -p tcp --dport 28967 -j ACCEPT
iptables -t filter -A INPUT -p udp --dport 28967 -j ACCEPT
iptables -t filter -A OUTPUT -p udp --dport 28967 -j ACCEPT

Sure. Makes sense…

SGC · January 21, 2022, 4:03pm

svet0slav:

Should be fine. What I use:

iptables -t filter -A INPUT -p tcp --dport 28967 -j ACCEPT
iptables -t filter -A OUTPUT -p tcp --dport 28967 -j ACCEPT
iptables -t filter -A INPUT -p udp --dport 28967 -j ACCEPT
iptables -t filter -A OUTPUT -p udp --dport 28967 -j ACCEPT

it should be fine, but i don’t think your node would lie about udp not working either…
clearly whatever network issues or otherwise you were having aren’t really resolved since this new set of nodes also got suspended.

lack of udp shouldn’t get you suspended tho and nor would a single reboot have any chance of getting your nodes suspended, i’ve had hours of downtime on new nodes and they didn’t get suspended.
and tons of reboots… the suspension most likely indicates that you have some kind of problem with the setup.

svet0slav · January 21, 2022, 4:19pm

Maybe the audits were happening when I had to reboot. Who knows… LOL. No logs when server not running.

baker · January 21, 2022, 4:32pm

This would only affect your online score, not your suspension score.

svet0slav · January 21, 2022, 4:42pm

It shows online 100% on all satellites, but thanks for letting me know.

It found ONLY this error about audits.

Jan 21 11:28:59 server storagenode[59218]: 2022-01-21T11:28:59.249Z ERROR piecestore download failed {“Piece ID”: “QUDRCIVXYTNFZD2JZRGOEXPKCBUEGVSWVWQNQDOFQBO3ZKZB7R5A”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “Action”: “GET_AUDIT”, “error”: “untrusted: unable to get signee: trust: rpc: dial tcp: lookup us2.storj.io: Try again”, “errorVerbose”: “untrusted: unable to get signee: trust: rpc: dial tcp: lookup us2.storj.io: Try again\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:140\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:62\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:497\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:104\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:97\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52”}

Any official comment on this from storjlings? I do not have such an issue on other nodes.

All the logging about this piece QUDRCIVXYTNFZD2JZRGOEXPKCBUEGVSWVWQNQDOFQBO3ZKZB7R5A

Jan 21 06:28:07 server storagenode[1181]: 2022-01-21T06:28:07.866Z INFO piecestore upload started {“Piece ID”: “QUDRCIVXYTNFZD2JZRGOEXPKCBUEGVSWVWQNQDOFQBO3ZKZB7R5A”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “Action”: “PUT_REPAIR”, “Available Space”: 5985070297216}
Jan 21 06:28:08 server storagenode[1181]: 2022-01-21T06:28:08.489Z INFO piecestore uploaded {“Piece ID”: “QUDRCIVXYTNFZD2JZRGOEXPKCBUEGVSWVWQNQDOFQBO3ZKZB7R5A”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “Action”: “PUT_REPAIR”, “Size”: 399104}
Jan 21 11:28:54 server storagenode[59218]: 2022-01-21T11:28:54.244Z INFO piecestore download started {“Piece ID”: “QUDRCIVXYTNFZD2JZRGOEXPKCBUEGVSWVWQNQDOFQBO3ZKZB7R5A”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “Action”: “GET_AUDIT”}
Jan 21 11:28:59 server storagenode[59218]: 2022-01-21T11:28:59.249Z ERROR piecestore download failed {“Piece ID”: “QUDRCIVXYTNFZD2JZRGOEXPKCBUEGVSWVWQNQDOFQBO3ZKZB7R5A”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “Action”: “GET_AUDIT”, “error”: “untrusted: unable to get signee: trust: rpc: dial tcp: lookup us2.storj.io: Try again”, “errorVerbose”: “untrusted: unable to get signee: trust: rpc: dial tcp: lookup us2.storj.io: Try again\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:140\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:62\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:497\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:104\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:97\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52”}

svet0slav · January 22, 2022, 7:48am

Still suspended. I figured us2.storj.io:7777 satellite to be accountable for about 0.001% of all node traffic. On all nodes. Not sure why it got suspended by not passing just 1 audit out of 2, but says 48.72% on the dashboard. I believe the code should be improved. You can’t just suspend, if it fails a second audit since start. This is just not good coding. Make it suspend, if it fails 3 of 5 ratio or something. 1/2 is bad. Logs…

root@server:/tmp# journalctl -u node | grep GET_AUDIT

Jan 21 11:28:54 server storagenode[59218]: 2022-01-21T11:28:54.244Z INFO piecestore download started {“Piece ID”: “QUDRCIVXYTNFZD2JZRGOEXPKCBUEGVSWVWQNQDOFQBO3ZKZB7R5A”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “Action”: “GET_AUDIT”}

Jan 21 11:28:59 server storagenode[59218]: 2022-01-21T11:28:59.249Z ERROR piecestore download failed {“Piece ID”: “QUDRCIVXYTNFZD2JZRGOEXPKCBUEGVSWVWQNQDOFQBO3ZKZB7R5A”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “Action”: “GET_AUDIT”, “error”: “untrusted: unable to get signee: trust: rpc: dial tcp: lookup us2.storj.io: Try again”, “errorVerbose”: “untrusted: unable to get signee: trust: rpc: dial tcp: lookup us2.storj.io: Try again\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:140\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:62\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:497\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:104\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:97\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52”}

Storjlings, please comment.

Pac · January 22, 2022, 9:23am

If the node is new, it’s being vetted. This period is there to ensure your node is “worthy” of the network, or should I say “healthy”.

As this stage, I think it’s actually a good thing if it fails and gets suspended quickly so the SNO can investigate and figure out a fix before the node joins the network for good.

Nodes usually don’t get suspended after one or two issues. My nodes can fail 10+ audits before being in (serious) trouble. But the younger a node, the more it gets affected by online failures.
Besides, your node did not precisely fail audits otherwise its Audit score would have dropped, instead an unknown error occurred during an audit. So it got suspended to let you know you need to fix it (during a limited time window) before it is allowed to go again.

With regards to the precise issue you’re facing (unable to get signee), personally I’m not sure what it means. The only kinda related thread I could find is the following, where @Alexey suggests a possible network issue:

Someone else than me might know better.

svet0slav · January 22, 2022, 9:25am

Really now? Then why did this even happen?

It did drop.

Was 100% before I rebooted the machine because of an update. Would you suggest ditching the node and starting a new one instead of it? It only earned a few cents for now - $0.17.

Pac · January 22, 2022, 9:49am

As I said, “the younger a node, the more it gets affected by failures”. For example, the online score is a 30 days moving average for nodes that are 30+ days old. Which means that if one of these nodes is offline for 1 day, its “online” score would drop to +/- 96%.
However, my understanding is that if a brand new node runs for 1 day, then gets offline for 1 day, its online score would probably drop immediately to 50% or so, because it would have been offline 50% of the time since it’s online.
EDIT: Only true for online score apparently, see @BrightSilence’s answer further down.

I’m sorry but I can clearly see on your screenshot that your Audit score is still perfectly fine (100%), but your Suspension score did drop however.
So I think we do agree that your Audit score did not drop. Right?

Not really, not yet at least: nodes are supposed to recover from suspension if the issue is solved. As it is new, it may stay in this state for a while (few days) as audits are quite rare at this stage. If a few successful audit hits your node, I believe its suspension score should recover to over 60%, and the node should be reinstated.

If your audit issue was just an unluck combination of factors because of the reboot, then I guess it will be fine and there is not need to replace it with a new node.
If the issue is still there however, this node’s scores will eventually get worse showing that something needs to be fixed on the system/network before starting a new one.

So either way, I’d keep this one and keep a close eye on it to see how it evolves.
Worse case scenario, this node dies within a couple of weeks, and you can replace it then - but hopefully it will not need to come down to that

svet0slav · January 22, 2022, 9:57am

What I initially thought, really. Thanks!

OK. Agreed. Thanks! I guess I will check on it in a week or so. Should not have connection issues. It is on very good and working hardware.

BrightSilence · January 22, 2022, 10:34am

This is a wrong assumption as that is not how the audit and suspension score formulas works. Dropping to this level can only happen after failing many audits.
I’ve mentioned before that the audit and suspension scores should never be displayed with a % sign as they aren’t percentages of anything. They are the outcome of a scoring formula that attempts to best represent node reliability. The actual scores are actually between 0 and 1. The way it’s currently displayed always leads to confusion.

I’ve done a lot of work in the past on researching the audit scoring system and suggesting changes to tune it. Failing 1 out of 2 audits would at most drop your score to 0.95 (95 on dashboard).

More info in this topic: Tuning audit scoring

@Pac 's comment is true only for the online score, which can rightfully be presented as a percentage, since it approximates the % online time in the past 30 days.

That said, I do agree with the approach, you need to know whether the issue still exists. Starting over isn’t going to help you if it does, it’ll just postpone the inevitable. And you can fix it and recover with the existing node, while still progressing on the other satellites.

Besides, you don’t need to worry about missing out on payouts from this sat… these are my earnings this month so far per satellite.

My lifetime earnings on us2 are a whopping $0.18. You should still fix it if there are problems as they might impact other sats as well, but this satellite is not a reason to kill your node over.

Pac · January 22, 2022, 10:45am

Hm

I guess I’m still confused!
Thanks for clarifying.

svet0slav · January 22, 2022, 10:48am

Please, God, tell him that is is exactly why I am concerned. It is obviously not so. Mine dropped below 50%. It happened and logs confirmed. If it was doing more audits during the reboot, i cannot know.

The other problem is this may change. No matter the sat is not that active now. It may be more active later.

OK. Final decision - not killing node to see what happens.

BrightSilence · January 22, 2022, 11:05am

The audit process on your node may not even have gotten to the point where it is logged. The scores are based on satellite observations, not on your nodes own. Keep in mind that there can also be a significant delay between the failures and the score drop. The restart is most likely what triggered updating the score, but the issues may have happened many hours before that. Do you even have all the logs for that?

Either way, without more info I can’t say what errors happened, but because of how the scoring formula works I can say the satellite must have marked your node with quite a few errors, not just one.

I don’t expect it to. This satellite was created for feature testing on the satellite interface. I participated in some small tests on it a while back. It’s not used by customers and I don’t think it will be used for load testing as Europe north and salt lake are usually used for that. Things can always change I guess, but it doesn’t seem likely that this will ever be a big sat.