Ok thanks for all the details - As I was using the script “successrate.sh” to check and never saw any issues in there in terms of audits failed etc. I thought I’m running perfectly.
Actually if there’s anything going wrong would be good to count whatever errors are there and show it in the dashboard - this has caught me by BIG suprise.
Anyway - I downloaded the logs just in case as in my setup they’re not staying there forever but what’s your recommendation to look through the logs? If I grep for error I get only these types (context cancelled which I thought is just me being too slow - hence ignored them)
thanks - I’m really checking my logs from time to time, there’s nothing in there. Standard log level…
It doesn’t spit out anything… Or maybe I’m just stupid at the moment
In the successrate.sh I’ve never saw any errors / failed audits (is that because the script doesnt work properly any more ?)
In the uptime / audit checks that the webinterface shows the numbers are ‘okayish’ no, except the asia one which is lower, how am supposed to see that something is broken here?!
satellite: uptime checks // audit checks us-central: 99.2% // 96.3% (now paused) stefan-benten: 98.5% // 99.0% asia-east: 98.4% // 92.2% (now paused) europe-west: 97.6% // 97.3%
I’ve randomly browsed (like once or twice a week) through the logfiles and never seen any other error than context cancelled so too slow…
And I’ve just ran the dashj.sh again but what’s striking is the score is now 1.000 (from 0.841390) for the first and 0.972972 (from 0.789686) for the last (both ones that are not paused…) just in a couple of hours? I dont get it
As I said, the audit score react extremely fast on any move in data. If you lost noticeable amount of data, it will quickly reduced to below 0.6
The same is happening when one will run a clone of the node, it will be disqualified within minutes.
By the way, @tankmann did you stop your raspberry Pi3?
On the dashboard you can see a check percentage for the lifetime of node. This is just percentage of successfully checks from a total amount. You can see them in the API too. It’s successCount/totalCount*100. It’s not an audit score.
I saw suggestions to show the score here:
But it’s not taken into account yet. Better to vote for such idea there: https://ideas.storj.io
Of course, raspberry Pi3 was stopped, copied over and then ‘killed’ But yes, good question - unfortunately not the answer.
So for my case as the process hung / swapped it probably just didn’t respond at all which then lead to the score below 0.6 … as I don’t see any error logs I assume data wise it should be good.
…
Is there still a time out in place for audits? I can imagine the situation @tankmann describes with the system becoming unresponsive might start the audit interaction, but isn’t able to respond in time or basically doing anything meaningful like writing to the logs. I have noticed before that Synology can completely hang when running out of memory. Noticed this on my old Synology (6GB) at least 2 times. New one luckily has much more RAM (16GB).
For what it’s worth, my RAM usage has been pretty much flat over the past week. I doubt the excessive RAM use was caused by Storj.
Thanks for the reply @BrightSilence, so in the task manager it showed docker.
But remember that there is also this weird thing in terms of docker showing used amount X where as the process shows low - see screenshot below: 905 MB ram usage vs. 4.51 GB under the container. I remember you mentioned it was a bug in docker for synology but the last docker update didnt fix it…
Unfortunately the last docker update was a pretty small one and Synology is still pretty far behind on updates. If you want to know how much RAM is actually in use by your node you can look at the specific process by opening the detailed info of the container and going to the process tab.
Yes that’s clear - thanks.
When I checked the RAM I didn’t use above, just checked top on ssh
But coming back to the real topic, my real question are like above:
Yes, I didn’t check RAM/Swap usage and now will get notified, ok
But in the meantime / before: How could I have seen that there might be some audits fail?
And to be honest I still don’t understand if there are real failures in the logs I uploaded, I don’t find any errors beside the too slow / context cancelled
Only when I use the dashj.sh script I now see that it shows for example 16374/16203 under audits (that 171 which is the difference) do fail. But is it really fail as in missing data or just no answer due to node being unresponsible
I’m still not clear on that … also because I don’t understand why my new setup should loose any data at all?
Well, I can only tell you that if your node is offline, you don’t fail audits. Just uptime checks. Hence why I asked about the timeout. It’s my guess that your node did respond in some way, but wasn’t able to complete the audit download in time. Hopefully someone from Storj can tell you more.
In the mean time, did you see any additional audit failures since you’re up and running again? That may help narrow it down to only when your NAS was unresponsive.
Funny thing is, I’ve never seen ANY errors ever since I moved to Synology.
I always used the successrate.sh script (is that still valid to check for failed audits?). This is also from this morning / and so from the last two days (1.5 days)
========== AUDIT =============
Successful: 188
Recoverable failed: 0
Unrecoverable failed: 0
Success Rate Min: 100.000%
Success Rate Max: 100.000%
========== DOWNLOAD ==========
Successful: 10647
Failed: 1
Success Rate: 99.991%
========== UPLOAD ============
Successful: 12566
Rejected: 0
Failed: 202
Acceptance Rate: 100.000%
Success Rate: 98.418%
========== REPAIR DOWNLOAD ===
Successful: 29
Failed: 0
Success Rate: 100.000%
========== REPAIR UPLOAD =====
Successful: 145
Failed: 1
Success Rate: 99.315%
when I searched for logs this morning this is what I got:
tankmann@Gynology:/ sudo docker logs storagenode 2>&1 | grep AUDIT | grep failed | tail
tankmann@Gynology:/
Yes my docker / node (whyever) restarts logs every 3 to 4 days, I think there’s a size limit … once the next update comes I’ll put the log to file in the config.yaml
I also uploaded the logs from the days of the incident to my dropbox but I didnt find anything weird in the logs, but maybe I’m stupid? Dropbox - errorlogs - Simplify your life
Images of the incident What you can clearly see is that disk utilization / ipos whatsoever dies completely during the swapping period.
Since you’ve previously determined the log doesn’t show any errors for the initial problem, I don’t think it’s a good way to check for remaining issues. You can use the dashboard api or the dashj.sh script you use before.
Please check:
Did the amount of audits increase since last checked?
Were there no more failed audits since last checked? (you’ll have to subtract success from total and compare to your previous post)
Check 1 will probably not be true for the DQ’ed sats, so focus on the ones that still work. Additionally you could check the scores which should have gone up as well.
Ok great aproach, 1 and 2 is the same if I compare from the above:
118UWp 16298/16127 0.841390 93919/92551 0.991399
118UWp 16404/16233 1.000000 94069/92701 0.998095
Difference audits: 171 vs. 171 (no more audit failures)
Difference uptime: 1368 vs. 1368 (no more uptime failures)
121RTS 345/318 0.598738 48134/47368 1.000000
121RTS 345/318 0.598738 48178/47412 1.000000
Difference audits: 27 vs. 27 (PAUSED SATELLITE // no change)
Difference uptime: 766 vs. 766 (PAUSED SATELLITE // no change even though some more uptime checks)
12EayR 14187/13669 0.598737 101408/100548 1.000000
12EayR 14187/13669 0.598737 101452/100592 1.000000
Difference audits: 518 vs. 518 (PAUSED SATELLITE // no change)
Difference uptime: 860 vs. 860 (PAUSED SATELLITE // no change even though some more uptime checks)
12L9ZF 20395/19852 0.789686 84714/82712 0.991485
12L9ZF 20694/20151 1.000000 85057/83055 0.999729
Difference audits: 543 vs. 543 (no more audit failures)
Difference uptime: 2002 vs. 2002 (no more uptime failures)
So that clearly shows since the swap happened / rebooted synology there was not a single audit issue.
So that’s supporting that in general there’s no issue with my setup right?
I would say that supports the theory that there is no actual data loss, but your node was just temporarily in a state where it was online, but unable to correctly respond to audits. As expected the audit scores are quickly recovering on the remaining satellites.
In my opinion this warrants reinstating the node on the 2 satellites where it was disqualified as I expect its score would recover quickly. But that call is up to Storjlabs. It would help if you could try to find out what caused the memory issues to begin with and prevent repetition. But it seems fairly evident to me this is not an issue of actual data loss.
I would say it’s pretty much unavoidable. If you allow any way out after the node has received the audit request it can just try to go offline to avoid failing an audit for missing data. I don’t see a way to prevent this kind of cheating while also allowing barely responsive and nearly hung system to still succeed or be allowed to not respond to audits.
If the protocol is designed such that a satellite or customer request data pieces without ensuring that the node is operational… and it’s possible for a node to be “half-operational” … then it’s possible for the Satellite or Customer to erroneously cause reputation damage to a SN. This is just as egregious a problem as a node dropping out of the network in an attempt to avoid being caught without data.
Both sides of the system need to be operational in a trustless way, otherwise, one side of the transaction can “cheat” the other.
There are methods available for ensuring that a data piece has been accepted and stored properly. If the protocol allows for one side to make an unverified claim, then the protocol is fatally flawed since participation in the network requires that both sides of the communication channel know that transactions are ACID.
Unpausing nodes is not possible any longer - hence my node won’t be unpaused
Paused is also still a wrong wording in the webinterface, will be fixed ‘soon’
I wrote back that in my case I didn’t loose any data which means I should have failed on uptime but not on data loss - even though I understand that my node became unresponsive / didn’t answer properly / swapped which means it’s on my side.
Nevertheless I think data validation should be done somehow ‘differently’ also because I’m holding quite some data on the other satellites which is now not used any more and just drives increased costs for storj.
I just hope (but don’t think so) if new satellites come my node is not getting any penalties - so far both EU satellites are bringing the majority of the test traffic… Can’t wait to see REAL DATA …