Disqualified on one satellite - Europe North - I have no errors in the log

I’ve been running Storj V3 since April -19 without any issues. I’ve updated for each new release and are currently on 1.15.3 My node id: 1PhmxtYEeaYrH2sYYyqvvukmWJeCpMFH6gDMYVbB9duid9P4Ek

Today I noticed I’ve been disqualified at the Europe North satellite. At all the other satellites I got 100%/100%. The message I see is ‘Your node has been disqualified on 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB . If you have any questions regarding this please check our Node Operators thread on Storj forum.’

I’ve checked the docker log and there are no errors. I search for ERROR, FAIL and AUDIT. Only AUDIT return something and it’s the normal INFO messages telling ‘download started’ ’ downloaded’

Only thing that has happened is about 4-5 hours downtime about a week ago while moving the equipment to a new building (docker storagenode stopped correctly of course). And yesterday there was a electrical issue in the building they had to fix so it was another hour downtime (docker storagenode stopped correctly this time too).

But if this is the cause of the disqualification why only at one satellite and 100% on all the others ?

How can I get online with the Europe North satellite again ?

1 Like

Disqualification is possible only if your node fail audits, so it must be online, answer on audit request, but did not provide a requested piece for audit, either because it’s lost, or inaccessible.
In case of timeout (5 minutes), it will be placed into containment mode and asked for the same piece three more times (with 5 minutes timeout each). If the storagenode is unable to provide a piece - the audit considered as failed.
In case of “file not found” - it will be counted as failed immediately.
Too much failed audits in a short period of time (a few hours) and node will be disqualified.

The disqualification is permanent and not reversible. If you do not have logs for a period between 100% audit and 59% - then you can’t see a error. When you removes the container the logs got removed with it too.

Ok, so delete node and start over ?

Why is that?
You still have other satellites and customers of those satellites will still pay you for the service.
Of course, you can call a graceful exit from the remained satellites (if your node is eligible to do so, i.e. it’s older than 6 months on said satellites) and start over if you wish.

No I’m out. Maybe come back in a couple of years when the error handling has matured.

In the satellite logs I can see a lot of audit timeouts between 2020-10-27 21:14:50.435 CET and 2020-10-28 08:23:17.384 CET

An example would be this one:

error: "context deadline exceeded; piecestore: (Node ID: 1PhmxtYEeaYrH2sYYyqvvukmWJeCpMFH6gDMYVbB9duid9P4Ek, Piece ID: 246UTQ6X2ZKNS7NYMUCEA46I5U3YHRVF7ZZ6GDYRTASIQFOTGKQA): context deadline exceeded"

That one satellite is one of the bigger stallites. You might be lucky that none of the other satellites was sending you enough audits in that timeframe to notice it.

The easiest way would be to spin up a new node. You could even reduce the allocated space of the old node. That way the old node will stop receiving any new data and the new node will slowly take over everything. At some point call graceful exit on the old node and that should be it.

That is strange, that was not the time/day we moved. When it can only be a ISP issue (routing issue between the Europe North and us?) as this server is one of my company servers used 24/7 and if it is down my customers will make sure I know about it very, very quickly :slight_smile:

Ok, your reply made me feel a little more positive :+1: I will let it be for a while and try to find info about running two nodes on the same external ip.

For this specific failure your node must have been online and accepting the audit request. It has 5 minutes time to deliver a 1 KB stripe.

Do you have netdata or something similar running? Maybe that could give us a hint if something was going on in that timeframe.

Unfortunately, we don’t have netdata. We are working on setting up Zabbix at all our sites but it’s about two months until we are finished with that project. If it would happen again, we should have all needed server data available. For now, I will save the docker/storagenode logs every 30 minutes to make sure we have all logs even if the docker container is restarted.

I have a second node up and running now, I will wait until it’s vetted before I start changing things. Thanks :+1:

1 Like

An easier way would be to redirect the logs to an external file. Following these instructions will place the log file in the same directory as your node data. Personally, I mounted a separate volume in the container and redirect the logs there. You’ll also want to set up log rotate for this file.

1 Like