Audit scores dropping on ap1; Storj is working on a fix, SNO's don't need to do anything. ERROR piecedeleter could not send delete piece to trash >> Pieces error: v0pieceinfodb: sql: no rows in result set

BrightSilence · July 31, 2021, 3:44pm

I’m not sure why you would be angry either. I’m happy that Storj had already found the problem before the community did. When I first posted about it, it wasn’t really a big deal, just noisy logs. I didn’t realize this could put nodes at risk at the time and chances are, neither did Storj Labs. By the time we realized it was effecting audits, it was midnight in the US. And it still seemed like an incidental thing. Things moved fast from there and yeah, I did contact support and tried PM on the forum to a few people to point their attention to it. Even though it was a bad time for them we got a response that the team was working on it within 8 hours. They prevented SNOs being hurt by this and took care of it. It may have felt like a long time because some of us feared for our nodes, but it really wasn’t all that long.

andrew2.hart · July 31, 2021, 3:59pm

I think I want the Team to be part of the community and post here themselves
Hmmm.
I’m a snowflake.

BrightSilence · July 31, 2021, 5:57pm

Well I’m with you there. It would have been nice to get a heads up. But as I said, at the time they likely didn’t know yet it could hit node reputation and the devs wanted to focus on fixing it instead. I don’t really blame them, but it’s worth mentioning that a heads up to the community on anything that could hit audits would be nice as a lot of us try to take best care of their nodes and see any audit failure as an absolute no no.

penfold · August 1, 2021, 7:41am

US has at least 3 timezones that I am aware of.

penfold · August 1, 2021, 7:43am

Because telling us would have avoided a lot of angst in the SNO community that care about their nodes - just read @SGC’s reaction to what was happening to his nodes and you might have some idea why a heads-up would have been valuable.

Alexey · August 1, 2021, 8:22am

@BrightSilence guessed how it was processed and he is right on 100%
I published information ASAP when I got it.
I cannot proceed faster, sorry.
The work must be done first to avoid mass DQ. I believe you will agree on that.
My part was is to update the thread with information. Without guessing and speculation.

SGC · August 1, 2021, 8:48am

i initially reported my nodes was starting failing audits at the 21th, i know people insist that these are two unrelated issues on us2 and ap1, but i’ve yet to have seen audit errors randomly appear and then two satellites begin making audit errors within 24 hours of one another.

i would very much be interested in knowing what actually happened… by the propagation i can only assume that the satellite code was flawed and then after being pushed to two satellites proverbially manure hit the fan.

and writing this off as it wasn’t a big deal is also underestimating it, my ap1 node that was doing the worst did in less than 12 to 6 hours drop from 100% to 75% in audit score.
it stopped there but for a time it was dropping about every time i updated the dash to check on it…
if this was the same across many nodes and if the issue had gone unnoticed, and if it really did grow worse of time, like it seemed to from my perspective.
then the network was only a couple of days out from DQ most nodes on ap1, atleast without direct intervention.

Alexey · August 1, 2021, 9:08am

And again I’m confirming that. These are two different unrelated issues. They are hardly reproducible in the real life with normal operations.

They are independent, and this is good that they have been discovered before the serious impact.
The speculation and guessing are not needed, thank you. When I would have this information - I would update the thread. I have no details, so I will not put my guessing and speculations too.

penfold · August 1, 2021, 9:19am

So, why write this then Alexey? That particular post can be taken no other way than a failure to communicate.

BrightSilence · August 1, 2021, 9:24am

Right, we’re being pedantic. Ok, I’ll bite. Yes it was 10pm through midnight in the US. For Storj Labs HQ in Atlanta it was midnight.

Yeah, cause they are. If you’re going to start your argument with a false premise, we’re not going to be able to have a fruitful conversation.

Again, starting with a false premise. Initially it may have been seen as not a big deal. I mean, I literally said myself that I though it wasn’t when we thought it was just some delete errors in the log. But from what I could tell work was being done over the weekend to ensure nodes wouldn’t get disqualified. That doesn’t feel like it was being downplayed by Storj Labs. And I can’t fault them for underestimating the issue at first, I literally did that myself in this topic.

I don’t see much use in playing the blame game other than to clearly state that SNOs were anxious about this and to ask for a heads up in the future if any issue might hit audits. I’m sure we’ll hear more when more is known about the root cause.

And @SGC you seem to have been unlucky with the one node that dropped to 75%. I’ve not seen others report numbers that low. And I’ve kept an eye on the number of disqualified nodes during that day. If there were additional DQ’s because of this it wasn’t an amount that was easily distinguished from normal DQ levels. And I’m sure Storj Labs has since corrected those.

Because at the time it wasn’t clear that this was anything more than a small issue with some errors in the log. They probably hadn’t seen it hitting audits, just like we didn’t at first in this thread.

penfold · August 1, 2021, 9:27am

Which means that the decision was then made not to watch for the fault appearing in the SNO’s dashboards “until Monday”. I stand by my comment then.

Storj hasn’t communicated how they work internally to my knowledge. They have staff all over the world in many different timezones. I’d also make the point their customers are all around the work too. I’m curious - in the storj support terms of service do they only say you will get support 8am-5pm Central/Eastern/whatever timezone suits? Yes, individuals can only work so many hours and work-life balance is incredibly important. Having worked Project life myself I know the toll it can take on your health. On my last project the top manager on the Owner’s team had to retire permanently due to stroke from the stress.

BrightSilence · August 1, 2021, 9:31am

It doesn’t mean that at all since they prevented scores from dropping over the weekend. And we were told before then that our nodes would be taken care of and no action was required from our end.

penfold · August 1, 2021, 9:33am

How in the hell are those two events linked?

BrightSilence · August 1, 2021, 9:37am

I don’t even know what two events you are referring to, but you’re clearly not going to be satisfied. So I’m ending this conversation on my end. But do know that if I can’t make sense of what you would have liked they did differently, they probably can’t either.

Alexey · August 1, 2021, 9:43am

Because the Community asked when it was discovered

Yes, it’s my failure, I took that. Thank you for remind.
I decided to do not heat guessing and speculations also to avoid panics like

before I have all information and before the problem will be investigated to fix the issue and to do not disqualify nodes.

penfold · August 1, 2021, 9:47am

You should be on the storj payroll that’s for sure.

When the events were noted as per Alexey’s message that they knew before the first post to the forum some extra alerts could have been put in place such that if the problem started to cause issues they would know as well.
Once the alerts were triggered then the SNO community could have been informed as to the status. It is the difference between storj being reactive and storj being proactive.

It appears at the moment that @Alexey is a critical lynch pin for storj and they are highly dependent on him.

Alexey · August 1, 2021, 9:49am

It did not affect audits at first. I’ll stop this discussion and speculations for now.
Again weekend and nothing will be added.

penfold · August 1, 2021, 9:49am

Could that not have all been avoided if you had simply made an early post to say “we have a possible issue of xyz. we are looking at it. Don’t worry.” and then add the same as you did before about no actions needed from the SNO side, etc, etc etc.

Alexey · August 1, 2021, 9:50am

I can only cite myself

Alexey · August 2, 2021, 8:56am

We haven’t found the root cause yet, but I’ll share what we have:
• on AP1 we started deletion of the bucket with massive number of segments
• during that time we find out that audit failure rate it going up for AP1 very fast
• after a research we discovered that pieces that are causing failures are from data set we are deleting atm
• we stopped deletion
• we stopped audits and we started to work on restoring SN reputation
• in the same time we decide to continue to delete this large bucket to the end
• when bucket was deleted and we deployed some additional fix to avoid failing audits in some edge cases we enabled audits on AP1
• so far situation looks ok

We tried to figure out the cause of this situation, and we found two issues related to the bucket deletion code, but we cannot find an exact link to the issue with audits. So because of that we cannot say the problem is fixed and we will be still investigating that issue.