Audits stopped. Why?

snorkel · January 13, 2023, 11:24am

I have one node, that is not upgraded and I’ll keep it until it dies, with 2 Ironwolfes, on Synology 1GB RAM. It’s slow, but earnings are similar to others.

snorkel · January 13, 2023, 11:27am

… attract new customers big and small, or limit the creation of new nodes; with some ticket system or something. The /24 subnet limitation seems to fail with experienced ITists and big investors.

Roberto · January 13, 2023, 11:42am

I was referring above all to the drop in performance of the nas when services such as the filewalker or the garbage collector start. The little nas appear to be undersized.

BrightSilence · January 13, 2023, 11:55am

That’s a pretty low power CPU too. I’m guessing you’re seeing pretty high IO wait as well. I’m not surprised your dashboards are loading slowly/not at all. You’re putting that thing to work.

jammerdan · January 13, 2023, 12:13pm

No, not at all I think. @john has mentioned that Storj is thinking about how to achieve Exabyte scaling. I believe/hope such thoughts are made for a reason. So probably plenty of nodes required in the future.

Roberto · January 13, 2023, 12:27pm

yes, he’s dying under the blows of trash

Roberto · January 13, 2023, 12:28pm

Indeed, the input should still remain high, it was growing

brandon · January 13, 2023, 1:08pm

We are trying Onboarding new customers and increasing customer usage is our main focus as a company. We have been onboarding new customers each month and are seeing promising signs!

jammerdan · January 13, 2023, 1:21pm

I agree. There was a lot of good activity by the end of last year.
But now there is also the increase of nodes from I think 16k to almost 21k.

So there are many more mouths to feed.

thepaul · January 13, 2023, 4:09pm

There might be some misunderstandings–

a timed-out response to an audit just puts your node in ‘containment’, wherein we pause audits to your node for a bit and then ask for the same piece again. We keep pausing and asking again up to 3 times, or until your node gives a definite response. So an overloaded system should not lose reputation very quickly.
that’s right, a timed-out repair does not impact reputation. So lots of repair requests would not necessarily lead to disqualification.

It’s a distributed system on an untrusted network- it’s all workarounds . But fundamentally, if we make it too hard to run a viable storage node, our network collapses and our product fails. So we’re well incentivized to keep it manageable.

We do mark nodes offline if they fail to check in for a period of time. And, as mentioned, timing out on audit responses affects reputation less than failed audits. There is also the planned feature we’ve mentioned a few times (or is it in already? I haven’t checked recently) where the node can shut itself down if disk reads are too slow.

I’m afraid selecting audits by node first would require us to keep a per-node piece map, which would dramatically increase satellite costs. I don’t expect it will happen. For the time being, our trust model requires every block of N bytes to have the same chance to be audited as every other, so the number of audits to a node needs to be linear with the amount of stored data.

BrightSilence · January 13, 2023, 9:12pm

No misunderstanding on my part. I was aware of what you mentioned. Containment probably saved my node as it would have probably not survived this situation without it.

Of course. I’ll keep reporting back if such things occur, but we’ve seen nodes on the brink of failure or beyond recovering from such scenarios. There is something to say for suspension in scenarios where a node goes from working fine to all timeouts in containment.

I do wonder if such features would still work in this unresponsive state. It was hard to tell what did and didn’t work at the time, since I couldn’t connect to the system.

I feared performance would be an issue here. That’s understandable. An alternative would be to change the impact of audits on score changes based on the number of pieces on the node. That could probably be done without significant performance impact.

littleskunk · January 13, 2023, 10:34pm

The storage node filesystem check is missing a timeout. If you don’t notice that problem yourself your storage node would currently get disqualified. [storagenode] The timeout is missing when we check a storage directory · Issue #4567 · storj/storj · GitHub

Edit: thepaul is correct that containment mode will slow down the disqualification a lot but the new audit system will handle containment mode in a different way. It will send you more audits to begin with and it will not pause audits in favor for pieces that are in containment mode. As far as I understand the new system will deal with both side by side. That should get you disqualified a lot faster.

My plan is to test this on the QA satellite. Let’s see how long the new system needs to get me disqualified. My test is from a different perspective. I want to find out if a node with a short problem like 30 minutes would survive. Your situation is a bit out of scope but if it turns out to disqualify storage nodes within hours we should increase the priority of the ticket above.

jammerdan · January 14, 2023, 5:37am

But there are also situations beyond the SNOs control that can make a system less responsive or even unresponsive. The terrible file walker after every update for example. Or just recently when you decided to halt GC.
I don’t think it should get nodes disqualified faster for situations they cannot control.
And again as I have said it numerous times, we are ordinary people and cannot monitor nodes 24/7.
So from my view anything that gets me disqualified even faster is a really bad idea.

And also wondering: If a system has already issues keeping up (for whatever reason, be it GC, file walker, big up- or downloads, or the SNO just using the hardware for other purposes) are you saying you are even putting more stress on it by sending more audit requests?

BrightSilence · January 14, 2023, 10:49am

Is it also expected to get tons of audits in a row now? This seems like it’s a problem waiting to happen. (Luckily I just have my new SSD cache installed right now and I can be sure I perform well)

A freeze when that happens would be instant bye bye atm.

Edit: Interestingly I see canceled audits as well now? Is someone abusing the audit feature for different tests again?

Ps. Sorry about the mixed logs for different nodes. I just saw this happen on the custom log tail I monitor and made a quick screen grab. I can provide full normal logs if needed.

littleskunk · January 14, 2023, 11:02am

Yes. We have another process running that is sending a big batch of audits to nodes with the intention to quantify how many pieces are lost. The idea is to scan all data relatively quickly. The results are only captured in a logfile and not committed back into the database.

Not sure if we could call it abuse but yea its us

BrightSilence · January 14, 2023, 11:07am

I’m guessing that means no impact on scores atm?

Well, it’s just that people get worried if they see those fail (or cancel for that matter). But I know you’re not looking to suddenly disqualify lots of nodes and you guys know what you are doing. It was quite curious to see though, haha. Audits have been so rare for a long time. To see such a huge batch suddenly peaked my interest.

littleskunk · January 14, 2023, 11:18am

The canceled audits are not according to plan. I will ask the team if that might be a bug that needs to be fixed.

The score will not get impacted.

Th3Van · January 14, 2023, 12:49pm

Just had 157 GET_AUDIT in ~33 seconds on one of my nodes, and it handled it without any problems.

2023-01-14T12:55:27.608+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "ZXJ2U5CKKM5SBMS7VXLKROVXG627AW5DZSJZDAVOCTAWP2ZRD3ZA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:27.651+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "ZXJ2U5CKKM5SBMS7VXLKROVXG627AW5DZSJZDAVOCTAWP2ZRD3ZA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:55:27.819+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "CGD25C5PFH5WSQDZYCDPK6V5HPE37NNEAILLKNC5FR7PRRKNY5PA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:27.959+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "CGD25C5PFH5WSQDZYCDPK6V5HPE37NNEAILLKNC5FR7PRRKNY5PA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:55:28.240+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "7G676VPTRT5SCREPWX6VJHXXL56DBXH6VK2H4X4EMON3F4QA7YPQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:28.318+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "7G676VPTRT5SCREPWX6VJHXXL56DBXH6VK2H4X4EMON3F4QA7YPQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:55:28.548+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "CSK2HQRKHDG7MZOYODFZLX4ARZQMEWP6CXS4TOCB6O2TWTLVX62A", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:28.604+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "CSK2HQRKHDG7MZOYODFZLX4ARZQMEWP6CXS4TOCB6O2TWTLVX62A", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:55:29.546+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "7FFIQKZFDESLLXRC3AT5QFNPQAMA3QOWRLSCKEBHF664LHUEY5YA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:29.612+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "7FFIQKZFDESLLXRC3AT5QFNPQAMA3QOWRLSCKEBHF664LHUEY5YA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:55:29.917+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "35Q5ST34KVM7ID4EHHBOHF2EVJ5MZB3IXV3HGO5VFOKRTFHRL67Q", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:30.149+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "35Q5ST34KVM7ID4EHHBOHF2EVJ5MZB3IXV3HGO5VFOKRTFHRL67Q", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:55:30.513+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "3QR7KDOSNQK3ZA5JMXEMR7LURSM27DBXMLYIVEVANN3BJBAUH6CA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:30.563+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "3QR7KDOSNQK3ZA5JMXEMR7LURSM27DBXMLYIVEVANN3BJBAUH6CA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:55:31.028+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "PT2P6M2DECQABIVNX5LJXU2EZNS735GIQFFZIQ4APFWUMWQMXYGQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:31.059+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "PT2P6M2DECQABIVNX5LJXU2EZNS735GIQFFZIQ4APFWUMWQMXYGQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:55:31.422+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "VPRDXB5METEVBULKMUDDHBNUDXZQWT55HEJN6QCI6RDBFCOACGYA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:31.499+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "VPRDXB5METEVBULKMUDDHBNUDXZQWT55HEJN6QCI6RDBFCOACGYA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
.
snip
.
2023-01-14T12:55:59.061+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "Z46QUKB6CR44VGJ4Q7ZWEFFJ2DCZZGD66JGAOU2DOP2JEDSHCWIA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:59.101+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "Z46QUKB6CR44VGJ4Q7ZWEFFJ2DCZZGD66JGAOU2DOP2JEDSHCWIA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:55:59.165+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "VFGY236NV22TIKWUOGQDCO7NW24RJXHVKB4BYQUFM3SGCXYRRWTA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:59.208+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "VFGY236NV22TIKWUOGQDCO7NW24RJXHVKB4BYQUFM3SGCXYRRWTA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:55:59.290+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "2MZIDZZCLFEGLN4Y33Y2EBLLH47R6JE24M5OVQGDTVDCQOGHY7TQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:59.322+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "2MZIDZZCLFEGLN4Y33Y2EBLLH47R6JE24M5OVQGDTVDCQOGHY7TQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:55:59.439+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "E7GOFMQ6EH2OA2KUCPIOO3JXHR53JXKO36H3IVV5IF3MQ7NKL5VA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:59.546+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "E7GOFMQ6EH2OA2KUCPIOO3JXHR53JXKO36H3IVV5IF3MQ7NKL5VA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:55:59.634+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "2QSNZGZVSCVUMB4YWHSA2763FR36EYJ4LOIH3XPWXPK5NBGRJWPA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:59.672+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "2QSNZGZVSCVUMB4YWHSA2763FR36EYJ4LOIH3XPWXPK5NBGRJWPA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:55:59.746+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "D46DVU4V3MGBKL465AOD26VDM2ED224PR767246KRYRDTOZOURQQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:55:59.775+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "D46DVU4V3MGBKL465AOD26VDM2ED224PR767246KRYRDTOZOURQQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:55:59.979+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "WNGALPC2CNKCI4LSLJPVKXALW2NHBK4HYAOUCNPNTVOGRGOIYTPA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:56:00.021+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "WNGALPC2CNKCI4LSLJPVKXALW2NHBK4HYAOUCNPNTVOGRGOIYTPA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}
2023-01-14T12:56:00.077+0100    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "IIJHZ3RTG2BYWVHWA2TJLPF242LLI53VIX3DHN4OO3IQ6KKRXKXA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-14T12:56:00.132+0100    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "IIJHZ3RTG2BYWVHWA2TJLPF242LLI53VIX3DHN4OO3IQ6KKRXKXA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "Size": 1}

Th3van.dk

Toyoo · January 14, 2023, 1:31pm

Is this the reason for the new Exists endpoint? I was guessing that this is probably what you want when I saw the commit some time ago.

littleskunk · January 14, 2023, 2:20pm

Yes that is the corresponding commit. It is used for the segment-verifier. This all goes back to the server side copy bug in combination with garbage collection we had last year.

Worst case situation:
Let’s say we have a segment with 80 pieces. The repair checker will queue up the segment below 52 pieces (I hope I have that number correct). The repair worker needs to download at least 29 healthy pieces. 52 - 29 = 23 lost pieces would still be recoverable. Let’s assume there is a segment with 24 lost pieces. 80 - 24 = 56. Oh no that segment would get queued up for repair to late. We might still have enough healthy pieces to repair it now but the normal repair process is not designed to deal with the fallout of this bug. → We search for these segments with the segment-verifier and we have also implemented a special segment-repair to fix this issue. Hopefully before the worst case segment hits the normal repair queue.

So far the results are looking promising. I will keep my fingers crossed until the segment verifier finishes.

Now that was just a simple example. In reality we shoot for a lower threshold than 23 lost pieces. Some time ago we stressed the storage node network with bigger GC bloom filters. We have seen segments on the repair queue with just 40 something healthy pieces. If we put in that number we better target for 40 - 29 = 11 lost pieces. Or the other way around it looks like there are many pieces with 1 sometimes 2 lost pieces. The normal repair process can handle that just fine. It was designed for that. We will look at the segment verifier result and draw a line. If the number of segments with 3 lost pieces is low enough we can as well just repair them for extra safety.