Working as designed. This version will allow you to run the storagenode, but it will not receive any traffic.
Perhaps design is not good, but it’s like this for the moment.
Any lower version of storagenode will not start either, it will crash with a message “too old version” in the logs and will be offline. More than 30 days in offline state - and it will be disqualified.
Can we check with satellite eu1. what was it? in this case?
Node ID is: 12nnNXUthnuz2gq2uotu2xcdouFonhhd54PcgXfuART2w4DQRay
because it shouldn’t happend so fast for a long time perfect node.
Im afraid that current disqualification method have some problems between distinguishing audit from time offline.
it kills a good node too fast in my opinion, the node was just having some time offline imho.
ideal for me would be an option to “Appeal” the disqualification, ( ideal a button in SNO dashboard) to ask sattelite to perform lastly failed audits once again.
Because whats audits function is, in my opinion, is to check files, right?. But currently dashboard says something else, as i showed in 1st post on screen: “Percentage of successful pings/comunication between the node & satelite”
i dont know if that disqualification was right or not, im saying it was not, and want to appeal please?
I sent a request to the team. Please, do not expect the quick answer. The team is busy and investigation is time consuming, and results are predictable - either timeouts or broken pieces, since you do not have failed audits in the logs and node is disqualified.
The tooltip is not exactly correct and I cannot imagine a better explanation. If you have it - please, suggest.
Tooltip for Audit score in the single satellite view
Yeah, no problem, Thx Alex.
Im just insisting to take topic seriously because few other ppl had similar problem and i saw the topic wasn’t solved completly.
My suspections, other than node did provide corrupted pieces, my suspection is the sattelite in some scenario may think the node is online (3 minutes after last contact or whatever it is?, i remember), and sends GET_AUDIT requests to node, which cannot be answered in time coz node is in fact offline. May happen if node goes online, for few minutes. and PC restarts because of some hardware problems, i had some problems with windows 10, and had few such starts of node that lasts only few minutes, im afraid sattelite didnt get too quick that node is offline again and send some GET_AUDIT. Again that happened only on eu1, and only that node, i have other nodes on this pc and they are fine.
The satellite requests the node and the node must answer, otherwise the audit considered as offline. This will affect an online score.
If node answered (so it’s definitely online), then the satellite will request a random piece, wait for 5 minutes until the node could provide the part of requested piece (a few kb) or until timeout. If node didn’t provide a piece, it’s placed into containment mode and will be requested for the same piece two more times. If node is unable to provide a piece anyway, the audit considered as failed and node went out of containment mode.
The second possibility, when the node answered on audit request, provided a piece, but it’s corrupted, the audit immediately considered as failed.
high enough disk latency that it fails audits repeatedly
run a version of the storagenode that is to old.
since your audits dropped we can exclude 2 and 4…
so either you lost data or you had issues with your disk, connection…
it doesn’t take much to get dq for a satellite with limited data… with the current audit model,
and really thats what happens in most cases… some tiny error happens, and because the node is small that error isn’t insignificant enough to that satellites amount of stored data… and the node eventually dies.
but it doesn’t happen without reason, your audit score doesn’t drop without failed audits, or atleast i’m unaware of any reliable data saying otherwise…
and personally i’ve been running for 16 months and has millions of successful audits…
failed a couple because i lost a few files during a migration…
my loss of just a couple of files, put my 6 month old node down to 87% briefly
so doesn’t take much to get dq on a satellite with low data…
basically just have to look at it wrong.
I appreciate you linking my research, but the entire point of it was that it does take a lot. Nodes with substantial loss of data get to survive for a really long time. To get disqualified you need to fail 15+% of audits, usually more.
I’ve pointed out elsewhere that your conclusion on what happened here is a statistical impossibility. You’re now using your wrong conclusion to confuse others, so I’m going to have to call that out. Losing just a few files on a large node will in many cases never even lead to a drop in score and when it does, with the current system it would never drop below 95%. A drop to 87% requires at least 3 failed audits in relatively quick succession, which can only happen if your system has lost a quite substantial amount of data or is unable to respond to audits correctly due to some other bottleneck.
Also, 12 days downtime leads to suspension, not disqualification.
well you don’t know how much data he has for each satellite, it’s a well known issue that low amounts of data leads to more volatility and thus younger nodes seems to be much more likely to just randomly getting DQ by the statistical outlier as you called it.
keep forgetting that duno why…
maybe i mistook the online score for the audit score, but i don’t think so because i thought it was very low and actually went back to check it again…
sure maybe i lost more than a couple of files, but usually not many a node gets in, when only online for as long as it takes to write docker logs --tail 200 storagenode --follow and then ctrl + c and docker stop storagenode
doesn’t exactly leave a lot of room for it to go wrong and the other bottleneck idea is just absurd, been running this for 16 months and never seen anything like that… besides bottlenecks usually don’t lead to audit failures… since it does later double audit checks on a failed audit, before it counts…
one time i stalled my pool so it couldn’t even shutdown the system and the storagenode couldn’t access it… and still it worked just fine… that was on another node tho and it never even got a failed audit because of my redundant setup… and that was 10 months ago now… and it ran like that for nearly 1 hour because i was curious to get some good info on how it would behave and how to avoid it, today i got a hardware watchdog configured to deal with issues such as stalls.