Disqualified after 2 hours of [edit] failed audits?

BrightSilence · July 6, 2020, 10:45am

Generally nodes don’t get reinstated unless there is a specific software bug. Despite what others may say, this is not a software bug. Your hardware failed. And while I agree that it would be great to implement a feature that would stop the node if the storage location is not available. That would be a new feature that would change the node design. But currently the node is working as designed. The node failing in your example wasn’t a bug in the node software, but a failure on your hardware. You can still create a support ticket and give it a try, but I don’t expect reinstating your node is on the table.

anon27637763 · July 6, 2020, 11:24am

There’s a really simple solution that could be used…

Place an empty file in the storage data directory. It could be called anything, but let’s call it:

is_connected

Run a script on your OS that checks if the file exists every 15 minutes or so. If the file doesn’t exist, the script shuts down the node.

peem · July 6, 2020, 11:59am

does such a script exist?

BrightSilence · July 6, 2020, 12:09pm

if [ ! -f /path/to/test/file ];
then
  docker stop -t 300 storagenode
fi

Simple enough. Just put that in a script you call from crontab. But really the node software should do this itself.

I think this one liner should also work. You could put that straight into crontab without having to call a script.

[ ! -f /path/to/test/file ] && docker stop -t 300 storagenode

Edit: This is just for Linux of course.

anon27637763 · July 6, 2020, 12:13pm

Beat me to it!

In any case… I’ll post what I was writing anyway…

A bash script to check if a file exists on GNU/Linux systems with bash installed:

#!/bin/bash

if [ ! -f is_connected ]; then
   echo -e "WARNING! Storage is disconnected.\nShutting down Node...\n"
#   iptables -I INPUT -p tcp --dport 28967 -j REJECT
   docker stop -t 300 storagenode &&
   exit 0
fi
echo -e "Storage is connected.\n"
#if [[ $(iptables -C INPUT -p tcp --dport 28967 -j REJECT) ]]; then
#  iptables -D INPUT -p tcp --dport 28967 -j REJECT
#fi

Notes:

I have not debugged the script. And the path for the empty file needs to set properly for whatever system and directory setup applies.
The is_connected file should be located in the configuration directory of your storage node. This directory will need to be accessible by whichever user is running the script. In the case of docker install, the simplest method to run the script is probably creating a cron job for root that runs every 15 minutes. However, it could be set to run every minute without any system problems.

To run every minute via cron:

As root:

crontab -e
* * * * * connected.sh

Tests

Run script without the file created:

$ ./connected.sh
WARNING! Storage is disconnected.
Shutting down Node...

Create an empty file and run the script:

$ touch is_connected
$ ./connected.sh
Storage is connected.

Delete the file and run the script:

$ rm is_connected
$ ./connected.sh
WARNING! Storage is disconnected.
Shutting down Node...

EDIT

Added closing ports via iptables to the script. To open and close the port, uncomment the iptables lines… remove the #

BrightSilence · July 6, 2020, 12:19pm

You were a lot more complete! So definitely worth posting as well.

kevink · July 6, 2020, 12:58pm

There might still be the issue of the storagenode “freezing” so that it just won’t shut down and won’t get killed. I’m not sure if in that case it would keep answering audits but it would still be running to some extend with iowait.

anon27637763 · July 6, 2020, 1:33pm

Any set of commands can be placed in the script. It’s possible to forcibly close the port if the concern is that the node will not shutdown properly.

SGC · July 6, 2020, 4:46pm

i don’t think you really understand just how unreliable harddrives are… storing data on a single drive long term is a failure waiting to happen… sure it might work for stuff that is rarely accessed… but this is not the case… .this is 24/7 availability.
in recent months i’ve been juggling around my storagenode plus most of my own data… multiple times… making me moved data in closing on 100tb… you wouldn’t believe how often errors pop up…

sure it might not happen… it’s very much a matter of luck… and i duno exactly how bad the case would be for a storagenode exactly… but from what i been able to learn and experience with copying around these vast amounts of data… generally there will be an error in about every 10 tb or so… if everything is functioning normally… if this is mostly related to writing/reading or both i don’t know… but it’s a factor… and then if you have a drive that is going bad, it will start producing tons of errors until the system can figure out what sectors are bad… if it can keep the drive working…

remember hdd’s are basically Vinyl LP just of metal and using magnetism instead… it’s amazing that it can even be as reliable as it is…

i don’t think running storagenodes on singular harddrives can be done for multiple years… but it’s a great way to start… makes it a lot more affordable… however one should also expect random errors to affect it… so either the storagenode programming will need to be redundant enough that errors doesn’t matter… much … in most cases… like OS or other mature software is today…

you may think redundancy in data storage is an extra expense… but i would argue that the time you end up spending or wasting on dealing with random issues because of errors will cost you more than the redundancy ever will… atleast long term…

SGC · July 6, 2020, 4:48pm

i’m running a variation of your older one line script… currently the only trigger is error… xD
so far it’s been going for like 12 hours without an error lol
i really like how my node is performing these days.

wanted to verify it will shut down the node never did get the docker one to work… but it was a good while since i looked at it… and it was mainly because i couldn’t make it run as a part of my already existing docker log thing…

SGC · July 6, 2020, 5:03pm

i would add so it also checks for failed audits, you could imagine cases where the usb disconnects and reconnects but gets a new id… whatever that whole /dev/sdX deal is called… then one could imagine the storagenode using the old /dev/sdX name and thus failing audits without the hdd being disconnected

or that lets say an smr drive stalls out, have if running on like zfs, then the is_connected would be in the arc and be loaded from there and thus be verified as exiting on the drive, because zfs knows it’s there even if the latency is insane… and yet everything else like audits might still be failing…

anyways i’m sure there are many such examples that could be made… but ofc checking the access to the drive would be a much less demanding way to keep track of it… but sadly it wouldn’t be a bullet proof solution…

anyways my thoughts about it…

i might steal that iptables thing… make a bit of an amalgamation of yours and brights scripts…
looks really useful… i was thinking of using ifdown… because i have the node on a dedicated nic… but i really like your solution it will apply well to so many more people…

anon27637763 · July 6, 2020, 5:19pm

This should managed via use of the filesystem UUID in fstab.

SGC · July 6, 2020, 5:44pm

@beast
well that may be however if one doesn’t use the system that actually kills the node to detect when its failing, how will it ever get even close to being 100% effective…

Hacker · July 6, 2020, 7:48pm

I think we’re talking about different things. I am saying that I am not willing to subsidize Storj by buying additional hard drives to create a RAID cluster or some other type of redundancy.

anon27637763 · July 6, 2020, 8:01pm

ZFS would probably need some special configuration.

I don’t know enough about ZFS options to provide hints on a solution… perhaps there’s a per-file setting that allows one to indicate that a given file should not be cached. However, it’s rather unlikely that anyone is going to be using ZFS over a USB connected array of many drives.

SGC · July 6, 2020, 8:02pm

i’m saying i’m not convinced there is a choice… if one expects a node to hold like as much data as it would get in like say 2 years…

SGC · July 6, 2020, 8:09pm

zfs pools pause all io on errors or simpy might not mount correctly during boot… and it might also work on zfs for all i know… just not 100% sure about it… more like 50/50 maybe… leaning towards it being cached, but still zfs may check that it’s still on the drive just a bit slower… no matter…

the point i was trying to make is that, there are maybe a thousand different setups, and monitoring the audits, might be the only way to cover them all… but your method is so elegant i would ofc keep that, it doesn’t take any meaningful server load to run that anyways… so why not… it may catch an issue audits doesn’t… or even save the node from a few failed audits…

Alexey · July 6, 2020, 9:42pm

@BrightSilence is correct, this is a hardware issue. We have added a feature request to prevent the storagenode from starting if mountpoint is missing/disappeared.
No ETA, no direct or final solution.
There are workarounds:

Use a subfolder on your disk for node’s data, as suggested in the documentation. This can help only for docker version though
Move the identity to the same disk and configure your storagenode to use this moved identity’s folder as an identity location: Step 5. Create an Identity - Storj Docs
Do not use USB/network connected drives if possible.
Configure a watch script, similar to

or

Hacker · July 6, 2020, 10:58pm

Well, it could also be an OS, driver or firmware issue.

Helps only upon startup. If you have an uptime of months, you’d have to restart the storage node every 30 minutes to catch e.g. a USB drive disconnecting.

This is the only workable solution at the moment, but it would make much more sense to implement this into the Node software than for each SNO to implement running such a script.

Alexey · July 7, 2020, 8:00am

Yes it could be. However, it’s not under our software control.

Not only. In some cases the container can fail and restarted by the docker. In such case it will not start and will fail again, then become cyclic restarts.
But it’s much better than being disqualified at that point.
So, the best workaround at the moment is using a watch script depending on your OS.