Tuning audit scoring

BrightSilence · February 9, 2022, 7:28pm

Thanks @thepaul for that great and extensive response and for sharing both your code and results!

I took some time to look through it all and play with the python script you provided. I’ll try to collect my initial thoughts here.

That is an absolutely fair criticism of my suggestion and I agree that because of this sticking with v=1 in both failure and success is not just preferable, but critical for ongoing monitoring and just for the resulting number to have any actual meaning. So, yes, no argument from me here. There are other, better ways to achieve our goals anyway.

I have to infer a little bit here why it is hard to adopt this. Is this because some of the scores may be quite low by chance using the current system and it would take really long to recover from that with alpha at 1000? If so, I guess you could give node reputations a reset. The worst thing that could happen is that bad nodes stick around slightly longer, but the new parameters will still make quick work of them if they are actually bad.

This of course has to be priority 1 for the reputation system. Agreed. This needs to be ensured before we look at other criteria. But I don’t think we should lose sight of those. I’ll get back to that in a bit.

I really appreciate you sharing this! Using your script already to run more simulations.

I’m not entirely sure what this is for. By setting the initial score to 1, the beta formula already kind of has a grace period built in, because it assumes a prior history of good audits and slowly adjusts from there. I’m not against it, but it’s just not entirely clear to me why it is needed.

These are great results indeed!

So I think this is great and I’m pleased to see a lot of thought has gone into this subject. I want to briefly get back to the “issues” I listed in my original post and go through them.

The scores are extremely volatile and unpredictable
Disqualification is inconsistent
The timing of disqualification is entirely determined by random “luck”
A node operator might look at the score and see it increasing and assume issues are solved
Nodes with over 10% of data lost are allowed to survive for very long periods of time (nearly indefinitely)
After failed audits, the score recovers just as fast, so failures never really stick

I think the ones in bold are solved by the changes you propose. Though the last one doesn’t necessarily show a much bigger impact of failures on the score, the fact that the score dropped at all would be visible for longer.

But I want to address the remaining ones as well.

The scores are extremely volatile and unpredictable
The timing of disqualification is entirely determined by random “luck”
A node operator might look at the score and see it increasing and assume issues are solved

I think I can address all three at once. I ran a quick simulation with your suggested parameters and 4% data loss.

I zoomed in on the relevant part of the graph. Because now the margin that informs the node operator whether there is an issue and how bad that issue is has been reduced to 89-100%. As you can see, the score is unfortunately still all over the place within this margin. And there are moments (circled) when the node operator might think the problems are solved. Even though 4% loss is enough that the node will almost certainly eventually be disqualified. However, the timing of that disqualification is still very random.

So, what could be done about that? Well, the only way to stabilize the score is to increase the remembering factor lambda. In the end, what you really want to say is if a node loses more than 4%, it’s out! The only reason we’re setting the score lower than 96% is to account for volatility. So I would suggest raising the lambda to remove volatility and raising the disqualification threshold even more because we don’t need as much slack anymore. I ran the same simulation with lambda at 0.999 and the threshold at 95%.

Scores are more stable, disqualification is a little more predictable and a recovering score that shows more than 1% improvement almost certainly means the situation of the node has actually improved.
I used your script to run a few simulations with this:

./simrep.py -d 0.02 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.95 -a 1000 -b 0 -g 50
with 2.00% data loss, 0.00% of runs hit dq (after 0.00 rounds on avg)

./simrep.py -d 0.04 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.95 -a 1000 -b 0 -g 50
with 4.00% data loss, 36.40% of runs hit dq (after 6183.83 rounds on avg)

./simrep.py -d 0.05 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.95 -a 1000 -b 0 -g 50
with 5.00% data loss, 100.00% of runs hit dq (after 3035.49 rounds on avg)

Tuning it even tighter by raising the threshold to 96% and you get exactly the result you wanted. No nodes disqualified at 2% data loss, all nodes disqualified at 4% data loss.

./simrep.py -d 0.02 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 2.00% data loss, 0.00% of runs hit dq (after 0.00 rounds on avg)

./simrep.py -d 0.03 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 3.00% data loss, 21.73% of runs hit dq (after 6275.94 rounds on avg)

./simrep.py -d 0.04 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 4.00% data loss, 99.77% of runs hit dq (after 2906.48 rounds on avg)

Ok, I guess 99.77% isn’t all, but I hope you’ll forgive me that last 0.23%.

Of course doing this would require to address the issue you mentioned with implementing this on top of existing scores. Could you perhaps expand on that a little?
Additionally it takes a few more audits to disqualify a node that lost all data, but as stated below… that may not be a bad thing.

Essentially the balance between high lambda and threshold vs low lambda and threshold is one between consistency and speed to DQ. I don’t know how important the speed to DQ is, but given that this adjustment would already ensure a lot more less reliable nodes don’t survive, I think it’s probably ok to have a little less speed of DQ.

There is one last concern I want to address.

What happens when temporary problems cause audit failures?

This may be especially relevant for the suspension score, which I assume would be aligned with the audit score system as to not provide an incentive for malicious node operators to redirect known audit failures to unknown audit failures. Intermittent issues in networks or hanging systems can temporarily cause multiple audit failures in a row that don’t reflect the durability of data on the node. Recently we’ve actually seen an increasing number of reports on the forum of nodes being suspended because of this.

With your suggested settings, it would take only 9 audits to be disqualified/suspended.

I took a quick look at my logs. That would give me just over 10 minutes to fix a temporary issue on my largest node for Saltlake.

2022-02-09T17:14:45.395Z        INFO    piecestore      download started        {"Piece ID": "YPQSHALIEQZEIRDCI2TAC4QFKAC4BR7X5NCAMYQ63AUZRVL5CPXA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:14:45.576Z        INFO    piecestore      downloaded      {"Piece ID": "YPQSHALIEQZEIRDCI2TAC4QFKAC4BR7X5NCAMYQ63AUZRVL5CPXA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:17:33.357Z        INFO    piecestore      download started        {"Piece ID": "4HGXC2QJBZZ4OF3XOZ2EOYOKJUF4X2WXW3DOVK3B5L2RQLM7O2KQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:17:33.524Z        INFO    piecestore      downloaded      {"Piece ID": "4HGXC2QJBZZ4OF3XOZ2EOYOKJUF4X2WXW3DOVK3B5L2RQLM7O2KQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:14.264Z        INFO    piecestore      download started        {"Piece ID": "NUP3WV4EHE2TKYRAHOMJNAYVJJLIAR4HDXY2DAXW4EGYO3FNB44A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:14.426Z        INFO    piecestore      downloaded      {"Piece ID": "NUP3WV4EHE2TKYRAHOMJNAYVJJLIAR4HDXY2DAXW4EGYO3FNB44A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:39.392Z        INFO    piecestore      download started        {"Piece ID": "3HKYXF5CHTH24JLBK6GDXE7C3Z6R6UBHLS6LMQ2E3XJW2E3RZFOQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:39.557Z        INFO    piecestore      downloaded      {"Piece ID": "3HKYXF5CHTH24JLBK6GDXE7C3Z6R6UBHLS6LMQ2E3XJW2E3RZFOQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:50.254Z        INFO    piecestore      download started        {"Piece ID": "QUTGWH4E7OLDF5WCYMJMCOVQS6QT5GRZEOKFHKAG4WZJLXPOOLKA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:50.433Z        INFO    piecestore      downloaded      {"Piece ID": "QUTGWH4E7OLDF5WCYMJMCOVQS6QT5GRZEOKFHKAG4WZJLXPOOLKA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:20:36.582Z        INFO    piecestore      download started        {"Piece ID": "MVK7W4PKFYJ7IGOKZ3MDY64I32TOGIF5CZO4IPKIWYLHX6YLHJFA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:20:36.762Z        INFO    piecestore      downloaded      {"Piece ID": "MVK7W4PKFYJ7IGOKZ3MDY64I32TOGIF5CZO4IPKIWYLHX6YLHJFA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:20:41.201Z        INFO    piecestore      download started        {"Piece ID": "3G5GZIVNQWRCYXS3ZOLGNNX5B5LP3RRG4BK75CLS64RKGSL2JUXA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:20:41.357Z        INFO    piecestore      downloaded      {"Piece ID": "3G5GZIVNQWRCYXS3ZOLGNNX5B5LP3RRG4BK75CLS64RKGSL2JUXA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:23:41.067Z        INFO    piecestore      download started        {"Piece ID": "4AHKUKGZ2OBELZRHQB6UPDF5K2OJZSAAZDPEJGRJHX22L57FKJKQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:23:41.260Z        INFO    piecestore      downloaded      {"Piece ID": "4AHKUKGZ2OBELZRHQB6UPDF5K2OJZSAAZDPEJGRJHX22L57FKJKQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:25:06.862Z        INFO    piecestore      download started        {"Piece ID": "MZ4VDL6NLGOVQ27ENNPHPUTNZAREL3OACRGGLTK3MZRM45NKEAYA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:25:07.240Z        INFO    piecestore      downloaded      {"Piece ID": "MZ4VDL6NLGOVQ27ENNPHPUTNZAREL3OACRGGLTK3MZRM45NKEAYA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}

Raising lambda would help with this, but I don’t think it solves the core issue. Implementing 0.999 would raise the number of sequential failed audits to 52 with a threshold of 95% or 41 with a threshold of 96%. On my node that would at best give me an hour to fix things. And well, depending on when such an issue occurs, I would most likely not be in time to fix it.

The problem is not what number of audits is required to DQ, but that it isn’t based on time instead. 52 audits is perhaps too long for new nodes, which can take weeks to get to that number, but way too short for larger nodes like mine. I don’t yet have a suggestion on how to solve this and perhaps it’s a separate problem that requires a separate solution. But I wanted to mention is as it seems your suggested numbers actually made that slightly worse, rather than better.

Thanks again for picking this subject up though @thepaul, looking forward to your response!