Disqualified after 2 hours of [edit] failed audits?

anon27637763 · July 1, 2020, 4:01pm

This is a nice and simple solution. I usually over-complicate everything.

nerdatwork · July 1, 2020, 4:05pm

All credit goes to @Alexey

Welcome to the club and yes there is a club for everything

BrightSilence · July 1, 2020, 4:26pm

It’s a nice solution and I use it, but I don’t think it will help if the node is already running. By that time the identity is in memory and it probably doesn’t matter that the files are not reachable. I could be wrong though.

anon27637763 · July 1, 2020, 4:43pm

It depends on how the software is written. I haven’t looked at it too deeply. If it was a web server, this would be true… however, it could be that each connection (or a connection after some timeout limit) requires the local node reading the private key from the drive.

Toyoo · July 1, 2020, 4:48pm

It looks to me that from SNO’s perspective the moment that the storage node reports an audit failure, it should go offline to prevent more failed audits? Or maybe even drop the connection in hope that satellite will assume it’s not a failed audit, but lack of connectivity.

To be honest I wonder why the storage node even knows that a given download request is an audit.

Hacker · July 1, 2020, 7:40pm

The would only help if the identity file is being checked / loaded often.

Pentium100 · July 1, 2020, 8:31pm

That’s actually a good idea, especially for external drives and such. Even though I do not use external drive for this, getting, say, two audit failures one after another may be useful to shut down the node to figure out the reason for those failed audits before restarting the node.

That looks like cheating.

SGC · July 1, 2020, 8:50pm

i’m going to bake in a monitoring of errors and audit score into my storagenode log export script for docker, will take a bit tho… i’m not that proficient in bash scripting yet.
got some pretty good ideas and options for testing to make sure it would work in 99% of all cases of issues…

atleast for linux / docker SNO’s

kevink · July 1, 2020, 9:08pm

In my opinion (especially if you use USB-connected nodes) it would be best to just close the port forwarding on the router after 3 failed audits per day.
When USB drives disconnect, it often leaves the nodes in a waiting state with high iowait but it might still react to incoming requests like this case. But you might not be able to kill it (at least on docker I was unable to do so). So closing the port would effectively bring the node offline instead of failing audits, only resulting in suspension (once the downtime tracking is active) instead of DQ.

Toyoo · July 1, 2020, 9:21pm

Indeed, so does to me. But given that SNOs aren’t assumed to be altruistic, this case has to be taken into account by the satellite as, well, expected.

kevink · July 2, 2020, 5:06am

It’s only cheating if you are trying to mask real data loss. If you just try to prevent your node from getting wrongfully DQed because you have an unreliable HDD mount, it’s not cheating anymore.

BrightSilence · July 2, 2020, 10:29am

I don’t think the cheating label is very useful in discussions, but I would consider going offline before an audit would be reported as “file does not exist” at least a misrepresentation. It would be different if a node goes offline after a failed audit to prevent more of them.

Though even if the node misrepresents missing data as offline, it would quickly get suspended and after a while disqualified. That last step isn’t in place yet, but I’m sure that’s there to make sure that such misrepresentations aren’t a way to get away with missing data.

Hacker · July 2, 2020, 1:11pm

I’ve hacked together a short AutoHotkey script that monitors the storage folder every 60 seconds and if it is unreachable it stops the Storj service:

#Persistent

StorjStoragePath := "D:\Storj\"

; Check if we run as Admin, if not, restart script requesting Admin privileges
; (to be able to stop Storj service)
FullCommandLine := DllCall("GetCommandLine", "str")
If NOT (A_IsAdmin OR RegExMatch(FullCommandLine, " /restart(?!\S)"))
{
	Try
	{
		If A_IsCompiled
			Run *RunAs "%A_ScriptFullPath%" /restart
		Else
			Run *RunAs "%A_AhkPath%" /restart "%A_ScriptFullPath%"
	}
	ExitApp
}

; Set a timer to check the storage path every 60 seconds
SetTimer, CheckStoragePath, 60000

CheckStoragePath:
	If !FileExist(StorjStoragePath . "*.*")
		Run, net stop "Storj V3 Storage Node", , Hide
Return

BrightSilence · July 2, 2020, 1:32pm

I wrote a one line solution for linux that would simply stop the node after a single audit failure a while ago.

I don’t use this myself, but I technically could. I’ve never had a single audit failure, so I’m not too worried about it. As I wrote there, this would be overly aggressive as it basically has 0 tolerance of any audit failures. So don’t use it unless you know what you’re doing.

Pentium100 · July 2, 2020, 1:35pm

I had a few due to the locked database.

Right now, if the difference between totalCount and successCount in the API changes, zabbix sends me an SMS. Though I guess it may be too slow, I should change that to use the log instead of API.

BrightSilence · July 2, 2020, 1:42pm

Yeah, the suggestion I posted uses a tail on the log, so it’s basically instant. The downside is that if the log is not available or updated for some reason, it would stop working. So keep that in mind. You probably won’t want the log on the same HDD as the data in that case as that would defeat the purpose.

Mark · July 3, 2020, 12:39am

The current DQ implementation feels like a ticking time bomb. I don’t think it’s setup this way because it’s the optimal way to handle trust as some people suggest. I think storj just haven’t got around to programming a more reasonable system yet. The fact that users are independently augmenting their nodes with scripts to prevent this aggressive DQ also suggests that there is room for improvement.

I am going to look into implementing one of the previously mention commands/scripts to knock my node offline in case of audit failure. It would be more convenient if the satellite would just suspend me like so many have already suggested but until then, this will have to do. I don’t want years worth of customer data to get DQ in 2 hours because a USB cable was bumped.

Pentium100 · July 3, 2020, 3:45am

It probably should be a suspension and not DQ. Node fails too many audits - suspended. If the node wants to get un-suspended, it has to pass most of those same audits it failed the first time. This would save against an unplugged cable etc.

Beddhist · July 3, 2020, 7:11am

@Hacker are your data files in the root of the USB drive? I seem to remember reading elsewhere that if the data is in a subdir and the drive goes offline the node shuts down with an error. So it is recommended to not put the node data in the root of a drive, USB or otherwise.

BrightSilence · July 3, 2020, 7:50am

It depends on which part you’re talking about. If a node has actually lost files, I think it is purely the trust issue. You need to keep in mind that nodes are completely untrusted entities. And data loss is pretty much the worst thing that can happen on the network, so it needs to be prevented at all costs. So a node permanently losing data is a big no no.

Now the situation where the HDD is unavailable is a little different. If the node can later prove it actually still has all the pieces. That may be a situation that’s worth allowing them to recover from. But it’s a little challenging to implement something like that as it would mean the satellite has to keep information about all failed audits. I can imagine that that example is a matter of priority in engineering effort. Honestly, I probably wouldn’t put that high on the list. Because despite the data being eventually recoverable, you’re still doing extra work to keep nodes that appear to have a chance to lose access to data on the network.

Keep in mind that even for the network, it is better for a node to be offline than lose access to its data. Because offline nodes are excluded from node selection and other nodes will be selected for download or upload instead. While nodes that are online but don’t have access to the data could be selected for download and if enough of them don’t actually deliver the piece, that download fails.

I think the best solution for the data HDD being unavailable is simply for the node itself to crash with a fatal error when that happens. That would just take it offline, SNO’s who have uptime robot are immediately informed even without Storj sending an email about the node being offline. And they can fix the issue without losing their node.

This helps, but it really only helps when you start the node and it doesn’t find the path. It’s also more of a linux thing as the mount point would still be a valid path if the HDD wasn’t mounted. On windows the drive letter would not be available and even if you use the root folder, the path won’t be valid at that point.