Disqualified after 2 hours of [edit] failed audits?

Hacker · July 1, 2020, 12:49pm

Seems I’ve been disqualified after two hours of downtime (hard drive disconnected for some reason) on all satellites except two. There goes my node (1QDccTJSqxPa5skjxoVYugqr79f7JAveF255y688QpZEBhT5Dt) of 8 TB after 9 months, just before payout of what was held back. Seems really harsh for 2 hours downtime to lose the payouts that have been held back during their respective periods over the last 9 months.

Any reason why the node wasn’t suspended but immediately disqualified? It’s not like I didn’t correct the problem as soon as I found out. However, if the first thing I see is that I’ve been disqualified, it’s quite demotivating.

EDIT: On Windows 10 using the native installer (no Docker).

anon27637763 · July 1, 2020, 1:22pm

Node running but hard drive disconnected is not “downtime” …

During those 2 hours of missing data pieces, requests for data pieces from your node were busy being repaired by other nodes. That’s where some of your held back funding went to…

It might be useful to describe what went wrong so that other SNOs can learn from your experience.

Was the drive connected via USB?
Did an automatic Windows Update disconnect the drive?

Hacker · July 1, 2020, 1:47pm

OK, what do I then call the situation when the node is online but the data is not available?

It is an external Seagate USB drive, had this happen with a WD drive on a different PC before - sometimes it simply disconnects (Windows does not see it anymore, as if the USB cable had been pulled out), a restart of the PC does not help, one has to physically unplug and replug the power cable from the drive, then it works normally again. It was the first time it happened with the Seagate drive but based on my experience with the WD drive I knew how to fix it (once I became aware of the problem). Too late for my node though.

I am not complaining about “some” of the held back funding but to lose half of my total funds because of two hours… well, as I said, it’s demotivating.

EDIT: This basically means that 8 TB of data which is now available again has now to be rebuilt from different sources. Does not seem like the optimal solution.

anon27637763 · July 1, 2020, 1:52pm

This is called “failed audits” within the Storj network. And once a node’s audit score falls below 60, the node is disqualified.

This is likely a hardware issue with the USB controller and/or drivers.

Hacker · July 1, 2020, 1:57pm

Well, it’s a randomly occuring problem which I have no way of fixing permanently but once I am aware of it I can fix it in under a minute. Some kind of “audit failed, please check your node” warning would have been great before the DQ. What happened to suspending a node? Is that still a thing?

anon27637763 · July 1, 2020, 2:06pm

Problem One:

Your node was running in such a way that the Storj network considered all your 9 months of data pieces as completely gone. Every audit check failed during that 2 hours. The DQ is very quick in order to protect the network and ensure the paying customers are getting the level of service they pay for…

A node which completely fails all audits becomes untrustworthy from the network’s perspective. The only method to restore full trust would be a full audit of all the data stored on the node. Neither the SNO nor the Satellite would benefit financially from a full node audit.

Therefore, the DQ is necessary and is also necessarily permanent.

There are numerous other threads related to the DQ vs. Suspend issue and the full node audit financial analysis.

Problem Two:

Using USB connected devices with a known unstable USB port may not be advisable for SNOs.

Hacker · July 1, 2020, 2:15pm

I understand that reasoning, however, I don’t think it’s the optimal solution to a problem such as presented.

Well, if we were talking about a known problem with a USB port, you’d be right. However, since it was a first-time failure where we both can only guess at the reason there is not much that other SNO’s can learn from my experience, except perhaps to build a mechanism that independently checks the health of your node because you might get DQ’d without warning.

EDIT: When talking about a randomly occuring problem, I should probably make clear that I am talking about the WD drive attached to my other PC, where such an issue occured a few times already. On my Storj node PC it is the first time it happened, so, it wasn’t predictable in any way.

anon27637763 · July 1, 2020, 2:21pm

MS Windows seems to have some issues related to stability of USB ports, as per some randomly found Windows forum post

In a prior post, you indicated that your USB port showed some stability issues. So, I would suggest to other SNOs, that if a USB port has stability issues that it might not be a good idea to run a node using that USB port.

I’m not sure how to do that on MS Windows. However, on GNU/Linux one could build a USB port checker using some creative udev rule

Hacker · July 1, 2020, 2:32pm

Please see my edit, I was referring to my other PC with the other (WD) drive.

Well, one could periodically check if the drive (or Storj folder) was available and if not, use some tool to send a notification email.

Anyways, since I cannot really do anything about my node anymore, I’d at least really like to see some warning mechanism built into Storj before a node gets DQ’d. Again, what happened to suspending a node? Is that still a thing?

anon27637763 · July 1, 2020, 2:54pm

As far as I know, Suspension is only used in situations where the Satellite still has some level of visibility into the problem. The error message on the Satellite side determines the type of failure. In your case, all audits would come back with a message saying the file was missing. Furthermore, no new data could be written to the node because the storage space was disconnected.

Hacker · July 1, 2020, 3:08pm

Yeah well, OK, I wish such a situation would be taken into account. For my selfish sake, yes, and also for the SNO’s to be given a chance to remediate the situation before the network (satellites?) decides all the data has to be rebuilt from other sources.

Anyways, will have to build my own health checks, I guess, until some health check mechanism is provided by Storj.

BrightSilence · July 1, 2020, 3:09pm

Suspension is basically for anything other than missing, inaccessible or corrupt files. This would include things like time outs, database errors etc. Storj calls these “unknown audit failures” and there is a related hidden “unknown audit score” (which is a horrible name) that is used for suspension. This is completely separate from the hard failures that impact the displayed audit score on the dashboard.

SGC · July 1, 2020, 3:19pm

i think you should take it up with the storjlings… if you got DQ in only 2hours of issues, that is unacceptable, when you can get the node operational again and having no real dataloss nor corruption…

ofc you will loose some data because the satellites starts to repair in your inaccessibility…

nerdatwork · July 1, 2020, 3:22pm

What would you have called it ?

You can file a support ticket and give your node id and other details to get the reason behind audit failures.

Hacker · July 1, 2020, 3:26pm

Appreciated, but I know the reason - the USB hard drive disconnected from the PC for a reason unknown to me, so the node was up but the data was not accessible. I wish I had received some kind of warning, since it was an easy fix (under a minute) once I knew there was a problem.

xyphos10 · July 1, 2020, 3:31pm

@Hacker you can use the below powershell script to monitor for when the usb disconnects. When it detects a usb disconnection event, it will shutdown the storagenode. All you have to do is change the $driveLetter -eq ‘D:’ with the Letter of your usb drive.

I know it is not a solution but it helps prevent those audit errors. If you combine this with uptimerobot monitoring you should get notified since the storagenode goes offline.

Register-WmiEvent -Class win32_VolumeChangeEvent -SourceIdentifier volumeChange
write-host (get-date -format s) " Beginning script..."
do{
    $newEvent = Wait-Event -SourceIdentifier volumeChange
    $eventType = $newEvent.SourceEventArgs.NewEvent.EventType
    $eventTypeName = switch($eventType)
    {
        1 {"Configuration changed"}
        2 {"Device arrival"}
        3 {"Device removal"}
        4 {"docking"}
    }
    write-host (get-date -format s) " Event detected = " $eventTypeName
    if ($eventType -eq 3)
    {
        $driveLetter = $newEvent.SourceEventArgs.NewEvent.DriveName
        #$driveLabel = ([wmi]"Win32_LogicalDisk='$driveLetter'").VolumeName
        write-host (get-date -format s) " Drive name = " $driveLetter
        #write-host (get-date -format s) " Drive label = " $driveLabel
        #Execute process if drive matches specified condition(s)
        if ($driveLetter -eq 'D:')
        {
            write-host (get-date -format s) " Starting task in 3 seconds..."
            start-sleep -seconds 3
            Stop-Service storagenode
        }
    }
    Remove-Event -SourceIdentifier volumeChange
} while (1-eq1) #Loop until next event
Unregister-Event -SourceIdentifier volumeChange

BrightSilence · July 1, 2020, 3:41pm

Trying to figure that out right now, I’m currently updating the earnings calculator to include it and looking for a clear short name for it.

I’m considering:

suspension (score)
audit suspension (score)

Maybe calling the other score “audit disqualification score”. I’ve also considered changing the score display altogether and display how close you are to disqualification or suspension. Something like (AuditDQ:0.0% AuditSusp:0.0%). Calculation would be 1-((score-0.6)/0.4). The upside is that this would not require the end user to know that 0.6 (or 60 or 600) is the threshold, 100% means disqualified/suspended. Downside is that it’s completely different from what Storj does on the web dashboard.

So yeah, I’m torn. But unknown audit score sounds like a score that is unknown and doesn’t imply that it would cause suspension.

Hacker · July 1, 2020, 3:43pm

Thank you very much. It would be useful to have such simple monitoring built in.

nerdatwork · July 1, 2020, 3:57pm

You could also save your identity & data on same USB drive so when identity is not loaded node won’t get online.

nerdatwork · July 1, 2020, 4:00pm

I would keep it simple and call it ^

Natalie Imbruglia knows your pain