Not really. It has some nice settings you can play around with. It keeps each reboot logs separately. And you can tell it to keep X much of size of data in total or clean itself on other conditions.
My suggestion would be to a) let the node run and see if it will recover from suspension. Suspension just means you will only have egress until the node recovers to over 60%, and b) try and determine the cause of suspension before starting a new node. In my experience suspension is not random, and whatever problems caused suspension for this node will likely effect the replacement node.
As a side note:
This indicates that UDP is not forwarded properly to the node, and although won’t cause suspension, you might want to fix that unless you have intentionally not forwarded UDP.
it should be fine, but i don’t think your node would lie about udp not working either…
clearly whatever network issues or otherwise you were having aren’t really resolved since this new set of nodes also got suspended.
lack of udp shouldn’t get you suspended tho and nor would a single reboot have any chance of getting your nodes suspended, i’ve had hours of downtime on new nodes and they didn’t get suspended.
and tons of reboots… the suspension most likely indicates that you have some kind of problem with the setup.
Still suspended. I figured us2.storj.io:7777 satellite to be accountable for about 0.001% of all node traffic. On all nodes. Not sure why it got suspended by not passing just 1 audit out of 2, but says 48.72% on the dashboard. I believe the code should be improved. You can’t just suspend, if it fails a second audit since start. This is just not good coding. Make it suspend, if it fails 3 of 5 ratio or something. 1/2 is bad. Logs…
Jan 21 11:28:54 server storagenode[59218]: 2022-01-21T11:28:54.244Z INFO piecestore download started {“Piece ID”: “QUDRCIVXYTNFZD2JZRGOEXPKCBUEGVSWVWQNQDOFQBO3ZKZB7R5A”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “Action”: “GET_AUDIT”}
Jan 21 11:28:59 server storagenode[59218]: 2022-01-21T11:28:59.249Z ERROR piecestore download failed {“Piece ID”: “QUDRCIVXYTNFZD2JZRGOEXPKCBUEGVSWVWQNQDOFQBO3ZKZB7R5A”, “Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “Action”: “GET_AUDIT”, “error”: “untrusted: unable to get signee: trust: rpc: dial tcp: lookup us2.storj.io: Try again”, “errorVerbose”: “untrusted: unable to get signee: trust: rpc: dial tcp: lookup us2.storj.io: Try again\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:140\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:62\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:497\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:104\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:97\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52”}
If the node is new, it’s being vetted. This period is there to ensure your node is “worthy” of the network, or should I say “healthy”.
As this stage, I think it’s actually a good thing if it fails and gets suspended quickly so the SNO can investigate and figure out a fix before the node joins the network for good.
Nodes usually don’t get suspended after one or two issues. My nodes can fail 10+ audits before being in (serious) trouble. But the younger a node, the more it gets affected by online failures.
Besides, your node did not precisely fail audits otherwise its Audit score would have dropped, instead an unknown error occurred during an audit. So it got suspended to let you know you need to fix it (during a limited time window) before it is allowed to go again.
With regards to the precise issue you’re facing (unable to get signee), personally I’m not sure what it means. The only kinda related thread I could find is the following, where @Alexey suggests a possible network issue:
Was 100% before I rebooted the machine because of an update. Would you suggest ditching the node and starting a new one instead of it? It only earned a few cents for now - $0.17.
As I said, “the younger a node, the more it gets affected by failures”. For example, the online score is a 30 days moving average for nodes that are 30+ days old. Which means that if one of these nodes is offline for 1 day, its “online” score would drop to +/- 96%.
However, my understanding is that if a brand new node runs for 1 day, then gets offline for 1 day, its online score would probably drop immediately to 50% or so, because it would have been offline 50% of the time since it’s online. EDIT: Only true for online score apparently, see @BrightSilence’s answer further down.
I’m sorry but I can clearly see on your screenshot that your Audit score is still perfectly fine (100%), but your Suspension score did drop however.
So I think we do agree that your Audit score did not drop. Right?
Not really, not yet at least: nodes are supposed to recover from suspension if the issue is solved. As it is new, it may stay in this state for a while (few days) as audits are quite rare at this stage. If a few successful audit hits your node, I believe its suspension score should recover to over 60%, and the node should be reinstated.
If your audit issue was just an unluck combination of factors because of the reboot, then I guess it will be fine and there is not need to replace it with a new node.
If the issue is still there however, this node’s scores will eventually get worse showing that something needs to be fixed on the system/network before starting a new one.
So either way, I’d keep this one and keep a close eye on it to see how it evolves.
Worse case scenario, this node dies within a couple of weeks, and you can replace it then - but hopefully it will not need to come down to that
This is a wrong assumption as that is not how the audit and suspension score formulas works. Dropping to this level can only happen after failing many audits.
I’ve mentioned before that the audit and suspension scores should never be displayed with a % sign as they aren’t percentages of anything. They are the outcome of a scoring formula that attempts to best represent node reliability. The actual scores are actually between 0 and 1. The way it’s currently displayed always leads to confusion.
I’ve done a lot of work in the past on researching the audit scoring system and suggesting changes to tune it. Failing 1 out of 2 audits would at most drop your score to 0.95 (95 on dashboard).
@Pac 's comment is true only for the online score, which can rightfully be presented as a percentage, since it approximates the % online time in the past 30 days.
That said, I do agree with the approach, you need to know whether the issue still exists. Starting over isn’t going to help you if it does, it’ll just postpone the inevitable. And you can fix it and recover with the existing node, while still progressing on the other satellites.
Besides, you don’t need to worry about missing out on payouts from this sat… these are my earnings this month so far per satellite.
My lifetime earnings on us2 are a whopping $0.18. You should still fix it if there are problems as they might impact other sats as well, but this satellite is not a reason to kill your node over.
Please, God, tell him that is is exactly why I am concerned. It is obviously not so. Mine dropped below 50%. It happened and logs confirmed. If it was doing more audits during the reboot, i cannot know.
The other problem is this may change. No matter the sat is not that active now. It may be more active later.
OK. Final decision - not killing node to see what happens.
The audit process on your node may not even have gotten to the point where it is logged. The scores are based on satellite observations, not on your nodes own. Keep in mind that there can also be a significant delay between the failures and the score drop. The restart is most likely what triggered updating the score, but the issues may have happened many hours before that. Do you even have all the logs for that?
Either way, without more info I can’t say what errors happened, but because of how the scoring formula works I can say the satellite must have marked your node with quite a few errors, not just one.
I don’t expect it to. This satellite was created for feature testing on the satellite interface. I participated in some small tests on it a while back. It’s not used by customers and I don’t think it will be used for load testing as Europe north and salt lake are usually used for that. Things can always change I guess, but it doesn’t seem likely that this will ever be a big sat.