I would keep it simple and call it ^
Natalie Imbruglia knows your pain
This is a nice and simple solution. I usually over-complicate everything.
It’s a nice solution and I use it, but I don’t think it will help if the node is already running. By that time the identity is in memory and it probably doesn’t matter that the files are not reachable. I could be wrong though.
It depends on how the software is written. I haven’t looked at it too deeply. If it was a web server, this would be true… however, it could be that each connection (or a connection after some timeout limit) requires the local node reading the private key from the drive.
It looks to me that from SNO’s perspective the moment that the storage node reports an audit failure, it should go offline to prevent more failed audits? Or maybe even drop the connection in hope that satellite will assume it’s not a failed audit, but lack of connectivity.
To be honest I wonder why the storage node even knows that a given download request is an audit.
The would only help if the identity file is being checked / loaded often.
That’s actually a good idea, especially for external drives and such. Even though I do not use external drive for this, getting, say, two audit failures one after another may be useful to shut down the node to figure out the reason for those failed audits before restarting the node.
That looks like cheating.
i’m going to bake in a monitoring of errors and audit score into my storagenode log export script for docker, will take a bit tho… i’m not that proficient in bash scripting yet.
got some pretty good ideas and options for testing to make sure it would work in 99% of all cases of issues…
atleast for linux / docker SNO’s
In my opinion (especially if you use USB-connected nodes) it would be best to just close the port forwarding on the router after 3 failed audits per day.
When USB drives disconnect, it often leaves the nodes in a waiting state with high iowait but it might still react to incoming requests like this case. But you might not be able to kill it (at least on docker I was unable to do so). So closing the port would effectively bring the node offline instead of failing audits, only resulting in suspension (once the downtime tracking is active) instead of DQ.
Indeed, so does to me. But given that SNOs aren’t assumed to be altruistic, this case has to be taken into account by the satellite as, well, expected.
It’s only cheating if you are trying to mask real data loss. If you just try to prevent your node from getting wrongfully DQed because you have an unreliable HDD mount, it’s not cheating anymore.
I don’t think the cheating label is very useful in discussions, but I would consider going offline before an audit would be reported as “file does not exist” at least a misrepresentation. It would be different if a node goes offline after a failed audit to prevent more of them.
Though even if the node misrepresents missing data as offline, it would quickly get suspended and after a while disqualified. That last step isn’t in place yet, but I’m sure that’s there to make sure that such misrepresentations aren’t a way to get away with missing data.
I’ve hacked together a short AutoHotkey script that monitors the storage folder every 60 seconds and if it is unreachable it stops the Storj service:
#Persistent
StorjStoragePath := "D:\Storj\"
; Check if we run as Admin, if not, restart script requesting Admin privileges
; (to be able to stop Storj service)
FullCommandLine := DllCall("GetCommandLine", "str")
If NOT (A_IsAdmin OR RegExMatch(FullCommandLine, " /restart(?!\S)"))
{
Try
{
If A_IsCompiled
Run *RunAs "%A_ScriptFullPath%" /restart
Else
Run *RunAs "%A_AhkPath%" /restart "%A_ScriptFullPath%"
}
ExitApp
}
; Set a timer to check the storage path every 60 seconds
SetTimer, CheckStoragePath, 60000
CheckStoragePath:
If !FileExist(StorjStoragePath . "*.*")
Run, net stop "Storj V3 Storage Node", , Hide
Return
I wrote a one line solution for linux that would simply stop the node after a single audit failure a while ago.
For logs written to a file tail -f /volume1/storj/v3/data/node.log | awk '/(ERROR|canceled|failed).*GET_AUDIT/ {system ("docker stop -t 300 storagenode")}' For logs in docker docker logs -f --tail 20 storagenode | awk '/(ERROR|canceled|failed).*GET_AUDIT/ {system ("docker stop -t 300 storagenode")}' Please be aware I haven’t tested the docker version. But it should work. This stops your node if it encounters a single audit failure, that might be overly aggressive as it would also kill your …
I don’t use this myself, but I technically could. I’ve never had a single audit failure, so I’m not too worried about it. As I wrote there, this would be overly aggressive as it basically has 0 tolerance of any audit failures. So don’t use it unless you know what you’re doing.
I’ve never had a single audit failure, so I’m not too worried about it.
I had a few due to the locked database.
Right now, if the difference between totalCount and successCount in the API changes, zabbix sends me an SMS. Though I guess it may be too slow, I should change that to use the log instead of API.
Yeah, the suggestion I posted uses a tail on the log, so it’s basically instant. The downside is that if the log is not available or updated for some reason, it would stop working. So keep that in mind. You probably won’t want the log on the same HDD as the data in that case as that would defeat the purpose.
The current DQ implementation feels like a ticking time bomb. I don’t think it’s setup this way because it’s the optimal way to handle trust as some people suggest. I think storj just haven’t got around to programming a more reasonable system yet. The fact that users are independently augmenting their nodes with scripts to prevent this aggressive DQ also suggests that there is room for improvement.
I am going to look into implementing one of the previously mention commands/scripts to knock my node offline in case of audit failure. It would be more convenient if the satellite would just suspend me like so many have already suggested but until then, this will have to do. I don’t want years worth of customer data to get DQ in 2 hours because a USB cable was bumped.
It probably should be a suspension and not DQ. Node fails too many audits - suspended. If the node wants to get un-suspended, it has to pass most of those same audits it failed the first time. This would save against an unplugged cable etc.
@Hacker are your data files in the root of the USB drive? I seem to remember reading elsewhere that if the data is in a subdir and the drive goes offline the node shuts down with an error. So it is recommended to not put the node data in the root of a drive, USB or otherwise.