I’ve submitted a ticket already but was wondering if I’m the only one having that issue.
Just came downstairs wanting to check the dashboard and it showed paused on one node but didnt open properly. Then checked the Synology and realized something is off, almost 24 hours ago Ram usage spiked and it began to swap plus traffic went down… took the whole night to fill until it swapped and system almost stalled.
I stopped the node even though almost not responding, after couple of minutes swap went down but some was still there so I rebooted the Synology, came back nicely and now everything is good again. Never had any mem issues before so could be the latest update … But I also have nothing else running on the Synology beside timemachine backups and none of my computers did it overnight…
Logfiles also look good and no audit issues or anything to see.
Anyone else seeing weird behaviour around that time?
As said opened a ticket to get unblocked on the us and asia node but wanted to see if I’m the only one seeing that behaviour during that time.
As also replied on the ticket I think there’s a big difference between:
Loosing data due to hardware or these issues (actually loosing data)
Being offline (which for me is the equivalent of in my case the node was still running but due to swapping etc. it became more or less unresponsive).
That’s also why my monitoring script didnt send any alarm (as process was still running).
I’ve now setup alarms for swap etc.
Also - disqualification is permanent now - I do understand in general for data loss but not for ‘offline’, but I know this is being discussed in the other threads as well (why disqualify a node if data is fully there but home DSL connection as an example was off for some hours)…
But I still don’t understand the scores vs the audits.
Foe example second satelite: 318 of 345 audit checks ‘ok’ which for me is more 92% … the score shows 0.59 (is that 59%?) - I read under 0.6 it disqualifies, ok got it… but as my process was hanging it wasnt communicating but data should be still there / and no ‘lost data’.
Also my past understanding was that audit is verifying if the file is there correctly, but the descriptions on the webinterface say:
Uptime Checks: Uptime checks occur to make sure your node is still online. This is the percentage of uptime checks you’ve passed.
Audit Checks: Percentage of successful pings/communication between the node & satellite.
So none of the checks have anything to do with file integrity right?
Sorry guys but I’m confused, apologies for posting my own progress here but after I switched from my raspberry Pi3 setup to the Synology I though I was doing GREAT…
Score calculated as described there: https://github.com/storj/storj/blob/cb894965698cea63dd64cde98bd6e1f188dc2c2b/docs/blueprints/node-selection.md
It’s falling fast, if node has missed data.
The check percentage it’s a percentage of successfully checks for node lifetime.
So the score is more precise. If your node managed to lost data, its audit score will fall within minutes. The percentage can even don’t notice the difference.
For disqualification we uses an audit score.
There’s a bit of background information as well as mathematical work required to extract the applicable equation from that document.
The uptime component is currently being ignored for calculating the score.
The score listed in the API is the Total Uplink Reputation
Given these two, non-obvious pieces of information… which require one to have followed along in this forum to understand… the applicable equation for manually calculating a node’s score from the numbers presented in the API is:
Unfortunately, this is a disqualifying score. Strangely, if the --erroneously calculated (IMO)-- uptime measurement was included in the reputation calculation, the weight of missing audits would be less than 1.0 … such situation would help raise the reputation score of this particular node, since the uptime is more stable than the audit score.
The labels alpha and beta in the API output are extremely confusing due to the fact that there are alpha and beta releases of the SN software… It’s not clear at all that the alpha and beta numbers are variables in the score calculation.
Yes, this is technical document and you can see how this score is calculated. To retrieve this information you can request the API and get a calculated value.
Ok thanks for all the details - As I was using the script “successrate.sh” to check and never saw any issues in there in terms of audits failed etc. I thought I’m running perfectly.
Actually if there’s anything going wrong would be good to count whatever errors are there and show it in the dashboard - this has caught me by BIG suprise.
Anyway - I downloaded the logs just in case as in my setup they’re not staying there forever but what’s your recommendation to look through the logs? If I grep for error I get only these types (context cancelled which I thought is just me being too slow - hence ignored them)
thanks - I’m really checking my logs from time to time, there’s nothing in there. Standard log level…
It doesn’t spit out anything… Or maybe I’m just stupid at the moment
In the successrate.sh I’ve never saw any errors / failed audits (is that because the script doesnt work properly any more ?)
In the uptime / audit checks that the webinterface shows the numbers are ‘okayish’ no, except the asia one which is lower, how am supposed to see that something is broken here?!
satellite: uptime checks // audit checks us-central: 99.2% // 96.3% (now paused) stefan-benten: 98.5% // 99.0% asia-east: 98.4% // 92.2% (now paused) europe-west: 97.6% // 97.3%
I’ve randomly browsed (like once or twice a week) through the logfiles and never seen any other error than context cancelled so too slow…
And I’ve just ran the dashj.sh again but what’s striking is the score is now 1.000 (from 0.841390) for the first and 0.972972 (from 0.789686) for the last (both ones that are not paused…) just in a couple of hours? I dont get it
As I said, the audit score react extremely fast on any move in data. If you lost noticeable amount of data, it will quickly reduced to below 0.6
The same is happening when one will run a clone of the node, it will be disqualified within minutes.
By the way, @tankmann did you stop your raspberry Pi3?
On the dashboard you can see a check percentage for the lifetime of node. This is just percentage of successfully checks from a total amount. You can see them in the API too. It’s successCount/totalCount*100. It’s not an audit score.
I saw suggestions to show the score here:
But it’s not taken into account yet. Better to vote for such idea there: https://ideas.storj.io
Of course, raspberry Pi3 was stopped, copied over and then ‘killed’ But yes, good question - unfortunately not the answer.
So for my case as the process hung / swapped it probably just didn’t respond at all which then lead to the score below 0.6 … as I don’t see any error logs I assume data wise it should be good.
…
Is there still a time out in place for audits? I can imagine the situation @tankmann describes with the system becoming unresponsive might start the audit interaction, but isn’t able to respond in time or basically doing anything meaningful like writing to the logs. I have noticed before that Synology can completely hang when running out of memory. Noticed this on my old Synology (6GB) at least 2 times. New one luckily has much more RAM (16GB).
For what it’s worth, my RAM usage has been pretty much flat over the past week. I doubt the excessive RAM use was caused by Storj.
Thanks for the reply @BrightSilence, so in the task manager it showed docker.
But remember that there is also this weird thing in terms of docker showing used amount X where as the process shows low - see screenshot below: 905 MB ram usage vs. 4.51 GB under the container. I remember you mentioned it was a bug in docker for synology but the last docker update didnt fix it…
Unfortunately the last docker update was a pretty small one and Synology is still pretty far behind on updates. If you want to know how much RAM is actually in use by your node you can look at the specific process by opening the detailed info of the container and going to the process tab.
Yes that’s clear - thanks.
When I checked the RAM I didn’t use above, just checked top on ssh
But coming back to the real topic, my real question are like above:
Yes, I didn’t check RAM/Swap usage and now will get notified, ok
But in the meantime / before: How could I have seen that there might be some audits fail?
And to be honest I still don’t understand if there are real failures in the logs I uploaded, I don’t find any errors beside the too slow / context cancelled
Only when I use the dashj.sh script I now see that it shows for example 16374/16203 under audits (that 171 which is the difference) do fail. But is it really fail as in missing data or just no answer due to node being unresponsible
I’m still not clear on that … also because I don’t understand why my new setup should loose any data at all?