I was mostly complaining that the problem of bad communication is still not solved. I won’t put more effort into complaints until Storj puts more effort into answering complaints.
I can confirm, this issue yet resolved.
The disqualification for three versions behind is not enabled as far as I know.
So, if your node doesn’t have failed audits in log, then two other possibilities will remain:
Working as designed. This version will allow you to run the storagenode, but it will not receive any traffic.
Perhaps design is not good, but it’s like this for the moment.
Any lower version of storagenode will not start either, it will crash with a message “too old version” in the logs and will be offline. More than 30 days in offline state - and it will be disqualified.
Can we check with satellite eu1. what was it? in this case?
Node ID is: 12nnNXUthnuz2gq2uotu2xcdouFonhhd54PcgXfuART2w4DQRay
because it shouldn’t happend so fast for a long time perfect node.
Im afraid that current disqualification method have some problems between distinguishing audit from time offline.
it kills a good node too fast in my opinion, the node was just having some time offline imho.
ideal for me would be an option to “Appeal” the disqualification, ( ideal a button in SNO dashboard) to ask sattelite to perform lastly failed audits once again.
- Because whats audits function is, in my opinion, is to check files, right?. But currently dashboard says something else, as i showed in 1st post on screen: “Percentage of successful pings/comunication between the node & satelite”
i dont know if that disqualification was right or not, im saying it was not, and want to appeal please?
The best course of action would be to create a support ticket - Submit a request – Storj
You can submit the node ID along with your information and the StorJ team should reply directly to your disqualification query.
I sent a request to the team. Please, do not expect the quick answer. The team is busy and investigation is time consuming, and results are predictable - either timeouts or broken pieces, since you do not have failed audits in the logs and node is disqualified.
The tooltip is not exactly correct and I cannot imagine a better explanation. If you have it - please, suggest.
- Percentage of healthy pieces of data on the node
- Percentage of trust of the satellite to the node
- own (please, specify)
Yeah, no problem, Thx Alex.
Im just insisting to take topic seriously because few other ppl had similar problem and i saw the topic wasn’t solved completly.
My suspections, other than node did provide corrupted pieces, my suspection is the sattelite in some scenario may think the node is online (3 minutes after last contact or whatever it is?, i remember), and sends GET_AUDIT requests to node, which cannot be answered in time coz node is in fact offline. May happen if node goes online, for few minutes. and PC restarts because of some hardware problems, i had some problems with windows 10, and had few such starts of node that lasts only few minutes, im afraid sattelite didnt get too quick that node is offline again and send some GET_AUDIT. Again that happened only on eu1, and only that node, i have other nodes on this pc and they are fine.
Maybe something like:
A trust score representing the nodes response quality on random file access challenges
The satellite requests the node and the node must answer, otherwise the audit considered as offline. This will affect an online score.
If node answered (so it’s definitely online), then the satellite will request a random piece, wait for 5 minutes until the node could provide the part of requested piece (a few kb) or until timeout. If node didn’t provide a piece, it’s placed into containment mode and will be requested for the same piece two more times. If node is unable to provide a piece anyway, the audit considered as failed and node went out of containment mode.
The second possibility, when the node answered on audit request, provided a piece, but it’s corrupted, the audit immediately considered as failed.
i would like to eliminate possibility of broken pieces at first, let me know if i lost pices or timeout. thx.
there are only a few ways to get DQ.
- loose to much of the storagenode data
- have more than 12 days of downtime in a month
- high enough disk latency that it fails audits repeatedly
- run a version of the storagenode that is to old.
since your audits dropped we can exclude 2 and 4…
so either you lost data or you had issues with your disk, connection…
it doesn’t take much to get dq for a satellite with limited data… with the current audit model,
and really thats what happens in most cases… some tiny error happens, and because the node is small that error isn’t insignificant enough to that satellites amount of stored data… and the node eventually dies.
but it doesn’t happen without reason, your audit score doesn’t drop without failed audits, or atleast i’m unaware of any reliable data saying otherwise…
and personally i’ve been running for 16 months and has millions of successful audits…
failed a couple because i lost a few files during a migration…
my loss of just a couple of files, put my 6 month old node down to 87% briefly
so doesn’t take much to get dq on a satellite with low data…
basically just have to look at it wrong.
I appreciate you linking my research, but the entire point of it was that it does take a lot. Nodes with substantial loss of data get to survive for a really long time. To get disqualified you need to fail 15+% of audits, usually more.
I’ve pointed out elsewhere that your conclusion on what happened here is a statistical impossibility. You’re now using your wrong conclusion to confuse others, so I’m going to have to call that out. Losing just a few files on a large node will in many cases never even lead to a drop in score and when it does, with the current system it would never drop below 95%. A drop to 87% requires at least 3 failed audits in relatively quick succession, which can only happen if your system has lost a quite substantial amount of data or is unable to respond to audits correctly due to some other bottleneck.
Also, 12 days downtime leads to suspension, not disqualification.
well you don’t know how much data he has for each satellite, it’s a well known issue that low amounts of data leads to more volatility and thus younger nodes seems to be much more likely to just randomly getting DQ by the statistical outlier as you called it.
keep forgetting that duno why…
maybe i mistook the online score for the audit score, but i don’t think so because i thought it was very low and actually went back to check it again…
sure maybe i lost more than a couple of files, but usually not many a node gets in, when only online for as long as it takes to write docker logs --tail 200 storagenode --follow and then ctrl + c and docker stop storagenode
doesn’t exactly leave a lot of room for it to go wrong and the other bottleneck idea is just absurd, been running this for 16 months and never seen anything like that… besides bottlenecks usually don’t lead to audit failures… since it does later double audit checks on a failed audit, before it counts…
one time i stalled my pool so it couldn’t even shutdown the system and the storagenode couldn’t access it… and still it worked just fine… that was on another node tho and it never even got a failed audit because of my redundant setup… and that was 10 months ago now… and it ran like that for nearly 1 hour because i was curious to get some good info on how it would behave and how to avoid it, today i got a hardware watchdog configured to deal with issues such as stalls.
Could you please search in logs for any GET failed?
sure theres some, and some attempts to ping sattelite like 9 attempts, will send You a link in Private message to dowload.
I was informed, that if you have GET_REPAIR failed it’s accounted against audit score.
What kind of errors do you have with GET_REPAIR?
No Errors with GET_REPAIR, just with GET only.
Lol i found that i cannot attach You a log file (7MB) in a PM, no way to upload a file othaer than image
i sent a log file via storj besite, the “request (11138) has been received and is being reviewed by our support staff.”
That’s good info to have. I assume this isn’t the case if the failure was a time out?
Normal GET traffic comes directly from customers unlike GET_AUDIT and GET_REPAIR which are both initiated by satellite services. So normal GET traffic can’t ever count against your audit score.
I have no details yet.