Audits failed because of context canceled errors

Alexey · July 12, 2019, 8:38am

We have a lot of disqualified nodes because of too many failed audits. It doesn’t looks like a normal behavior.
I have a report from SNO @Krey who gathered a full statistics from the logs.
He wrote a log analyzer, which can even show some graphics in the terminal!

The report from @Krey

HW: Xeon 8 cores 32GB, two HDD in RAID0, the work mode - idle.

The graph from storagenode for most active satellite:

The graph for the same satellite for failed audits:

As you can see, audits has been failed in the same time, when their amount was huge.
From logs the reason is only one - GET_AUDIT - context canceled

If we take a look into the stat from the same node for other satellite in the same time frame, we will see that all audits are completed.

Monthly uptime:

The result - this node 1spYgC4XsjQiQgvynaAfzigRAUGcVTHcDQLDkGqkp8F1Rt4N7k is DQ 2019-07-10 01:10:35.343351+02 because of too many failed audits

Thanks to SNO zblzgamer for the stat site!
He wrote the site for stat from v3. http://storjnet.info/

Krey · July 12, 2019, 1:36pm

Thanks Alexey.
My english is bad, but i try to add some details.
In this node i have RAID1 (mirror) pool of two 8TB Hdds. I use ZFS. At time when DQ happened i look on netdata (system resource monitor) page and I noticed abnormal behavior of storagenode process. It generate 100MB/s read workload on zfs l2arc SDD. Main pool does’t have any workload. Looks like storj cycled read same data alltime and not properly answer for audits.

ADD:
Another strange thing is that actually GET_AUDIT failed only 48 times or 75% completed audits on this time frame. That above the threshold of 60%, in other words, it should not have caused disqualification.

Full analizer report https://pastebin.com/raw/QsWVUSDj
Zipped log https://transfer.sh/15MDzX/2019-07-10-node01.zip

KernelPanick · July 13, 2019, 1:29am

What is “ID Difficulties” chart signify? Is this a health / reputation metric?

OH, The identity diff. nevermind!

Very nice site!

littleskunk · July 13, 2019, 3:40am

The 60% is missleading. We are using a system that is called beta reputation. This system works with 2 numbers alpha and beta. Alpha keeps track of the audit success messages while beta counts the failures. Now we can balance the weight of both numbers. We could tweak it to allow only 3 audit failures on top of 1000 audit success messages. It is correct that we disqualify storage nodes for having alpha/(alpha+beta) < 60% but that doesn’t mean alpha and beta are increasing and decreasing at the same rate. I can’t give you a better answer because the math behind that is complicated and we will have to adjust it.

Krey · July 13, 2019, 11:36am

alpha and beta calculating from node start continiously or it sometime reseting? This node is old and have thousands succecefull audites.

And about audites. SNO can responsible only for uptime and stored data. We cant response for storagenode process timeouts and it reliability. Soft have bugs always. DQ must be activated only if 100% shure of it reasonable. Context canceled on audites it something different that reason is unknown.

littleskunk · July 13, 2019, 1:49pm

Lets say you fail an audit and beta increases a little bit. Now you have many audit success. Over time alpha and beta will get back to the optimum value. On the other side you can have a very long history of successful audits but suddenly fail a few in a row and get disqualified. It works like a moving average.

That makes no difference from the satellite point of view. We have to disqualify the storage nodes or we risk losing data.

That is the reason we call it alpha. If I have to chose between finding these issues now even if that means we might disqualify a few storage nodes for no good reason -or- disable that feature and wait for beta launch… The result will be the same but way more destructive and expensive in beta.

We are aware that storage nodes got disqualified for bugs. @Alexey can help in that situation.

Krey · July 15, 2019, 1:04pm

I’m trying to convey the idea that the disqualification by unknown “context canceled” reason should not be in alpha beta or even production.
All SNO’s now run storj just for idea. Many such will remain after several unfair DQs? When everyone realizes that the chances of getting a 50% escrow are no more than beating a casino.

AFAIK @Alexey does not have the ability to restore the node. He can transfer the escrow to another node, which I will create again by register to the general wait list. It’s humiliating. I am not going to register one more mail account as the flawed Troll. You made disqualification mechanisms but you forgot about arbitration and node recovery.

Alexey · July 20, 2019, 6:02pm

My second node is disqualified too. Of course, all data in place before the network wipe.
I think there’s a bug in the audit process.

littleskunk · July 20, 2019, 6:18pm

I have checked that already. There is no bug in the audit process but there are bugs on the storage node side. With the current release most of that is fixed. If you got disqualified in the old release you could write the support and we will double check it.

Edit: I am aware that you are part of the support team. The message is more for everyone else in the same situation

heunland · July 21, 2019, 10:53pm

FYI our support staff does not have the power to transfer held amounts. Support staff can only diagnose the problem and then pass on the information to the data science team for analysing how to proceed in each case, and then relay the data scientist’s decision back to the SNO who was affected and filed a support ticket. So we did not forget about arbitration and node recovery. When we get feedback about approval of restoring paused nodes of SNOs who filed support tickets, we will notify them. Unfortunately, this process takes longer than any of us would like. Thank you for your patience.