Storjlabs/watchtower went crazy

ddon · February 28, 2020, 8:55pm

hello, had a very healthy node, which worked for some time without any issues… then started to notice strange errors in logs, and then looked around and noticed that watchtower created more than 1000 docker instances… then, noticed a sign on my node: “Your node has been disqualified on XXXX, If you have any questions regarding this please check our Node Operators thread on Storj forum”.

So, killed all dockers instances, run clean beta storj instance, and it is running fine now, but still, my node is disqualified now. What is the procedure and what should I do now?

heunland · February 28, 2020, 9:46pm

you can keep running the node on the satellites it is not yet disqualified on. If it is disqualified on all satellites, request a new auth token and start a new node from scratch, and you can delete the old node and all associated data which by that time is already getting repaired to other nodes anyway.

ddon · February 29, 2020, 1:37pm

ok, will request new auth token… just kind of a strange system in place – reason node was down, because of the update bug in the watcher, and in result, nodes disqualify my node

Alexey · February 29, 2020, 2:25pm

The node can be disqualified only for failed audits at the moment.
To fail audits your node should lost access to its data.
This could be if you use the -v option instead of --mount in your docker run command, or you deleted data for a reason or unexpected during the drive failure or with powers loss.

ddon · February 29, 2020, 2:41pm

docker stopped working correctly, when watcher created more than a thousand copies, and server in general stopped working normally, and probably that’s why it failed an audit.

I use --mount for sure, and I haven’t deleted any data, and nor I have drive failure…

ddon · February 29, 2020, 2:43pm

anyway, if there is nothing can be done, and data that I have was marked as lost, and not needed anymore, I will delete it and start over, no problem…

i requested a new auth token, but nothing came by email (I used the same email)… will wait, may be it will take some time, don’t know.

heunland · February 29, 2020, 3:52pm

Please be sure to use only Chrome or Firefox browser with all adblockers disabled when requesting an auth token. If you did not get it by now, you have to make a new request. If you already requested another auth token before with the same email address, you won’t get a new one until you have claimed the previous auth token (signed an identity with it). Also, no more than 1 request per 24 hours.

fmoledina · March 3, 2020, 2:57am

I had the exact same issue happen to me late last week. Were you running on ARM? My x86 nodes with v2tec/watchtower are not facing the same issue. Haven’t tested storjlabs/watchtower on x86. On the ARM node (Helios4 device), I decided to use storjlabs/watchtower since I was only using that node for Storj.

I’m not 100% sure if storjlabs/watchtower is the culprit but I can definitely say that I had upwards of 100+ containderd-shim instances and a load average of 80+. In my case, the node just went offline so fortunately I didn’t fail any audits. I was able to move the identity + data to another machine and start up again.

ddon · March 3, 2020, 8:50am

why there is no builtin ability for node to start over? why do I need to get new identity? and I requested it, but never got an answer.

baker · March 3, 2020, 2:11pm

The reason you must start over is this prevents bad actors from harming the network. In a (nearly) trustless system, you don’t want nodes that have proven to be untrustworthy back on the network.

Make sure you use Chrome or Firefox with adblockers disabled. Also, if you happen to have a previously unclaimed auth token attached to the same email address, the system won’t give you a new one until the previous one was claimed (this may not apply to you).

dstave · March 9, 2020, 7:19pm

Hi, I unfortunately did this a couple of days ago on my Ubuntu 16.0.4 machine. For a long time my node has been running fine, and auto upgraded via watchtower. Therefore I have not paid attention to new upgraded parameters. I wanted to add new parameters for web-console (totaly unnessesery for Linux CLI, but i did it). Opened my old start command (with -v option) and modified it, and now I’m unable to stop and remove the storagenode container. Probably been diqualified and lost my 2,5T of data …

Alexey · March 9, 2020, 8:43pm

You can uninstall the docker and then install it again, as a simple solution.

The more complicated:

sudo service docker stop
sudo rm -rf /var/lib/docker/containers/*
sudo service docker start

Then run the storagenode and watchtower back