Node disqualified on all sattelites

fonzmeister · October 9, 2020, 10:01am

I was away last week and I got an email that my node was disqualified. I do not really get why? My assumption is something went wrong during the update?

Anyway, my node never went into suspension mode which is also a bit weird isn’t it?

My stats are all fine

========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            383
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                9
Fail Rate:             0.054%
Canceled:              61
Cancel Rate:           0.365%
Successful:            16656
Success Rate:          99.582%
========== UPLOAD =============
Rejected:              0
Acceptance Rate:       100.000%
---------- accepted -----------
Failed:                3
Fail Rate:             0.012%
Canceled:              78
Cancel Rate:           0.307%
Successful:            25365
Success Rate:          99.682%
========== REPAIR DOWNLOAD ====
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            8377
Success Rate:          100.000%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              10
Cancel Rate:           0.119%
Successful:            8359
Success Rate:          99.880%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            1813
Success Rate:          100.000%

The only error I see in my log is:

OR contact:service ping satellite failed {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "attempts": 7, "error": "ping satellite error: rpc: dial tcp 78.94.240.189:7777: connect: connection refused", "errorVerbose": "ping satellite error: rpc: dial tcp 78.94.240.189:7777: connect: connection refused\n\tstorj.io/common/rpc.Dialer.dialTransport:211\n\tstorj.io/common/rpc.Dialer.dial:188\n\tstorj.io/common/rpc.Dialer.DialNodeURL:148\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:124\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:152\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

If I load my dashboard i see my node went offline on the 6th and never came back online. So probably something went wron during the update

Probably nothing that can be done anymore but maybe of interest to the storj team?

baker · October 9, 2020, 2:18pm

Do you redirect your logs to a file? Perhaps an upgrade happened between when it was DQd and when you checked the stats. Nodes won’t necessarily go into suspension mode before being disqualified.

fonzmeister · October 10, 2020, 8:17am

Unfortunatly not…
But I see no ingress or egress since 6 oct. So I am probably disqualified because of downtime…

jeremyfritzen · October 10, 2020, 12:22pm

Disqualification for downtime is not enabled yet.
So it shouldn’t be for that reason.

Skyblockpro1 · October 12, 2020, 7:26am

I think i might know why it is this is because 118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW is down from what I understood in different sections of forum
its known as satellite.stefan-benten.de:7777 and it apparently is down forever

Alexey · October 12, 2020, 8:23am

This is not related to the disqualification on all satellites.
To be disqualified your node should:

Answer on audit request (i.e. being online)
Unable to provide a piece for audit four times with 5 minutes timeout.
Do like this for several audit requests in a short period of time (a few hours)

The reason for inability to provide a piece - that’s the main point.
If node response with “file not found”, it will be marked as failed immediately.
If it’s unable to upload a few kb of piece for audit within a 5 minutes, it will be placed into containment mode and will be asked for the same piece three more times. If it’s still unable to provide it - the audit considered as failed.
Too many failed audits -> quick disqualification.

fonzmeister · October 12, 2020, 9:07am

But as you can see I never had any failed audit. So I have no idea what happened. It must have crashed somewhere after the update

peem · October 12, 2020, 9:37am

You haven’t had an audit failure since last reboot … no “never”.
The script displays data from the beginning of the log, you need to use api to check all audits (with the whole life of the node)…

for docker like this (in terminal):

for sat in `wget -qO - 192.168.1.15:14003/api/sno | jq .satellites[].id -r`; do wget -qO - 192.168.1.15:14003/api/sno/satellite/$sat | jq .id,.audit; done

(adjust the IP and port to what you have set for your node).
If you have Windows you may need to use a different command (ask @Alexey )

Fragment of the result:

{
  "totalCount": 826,
  "successCount": 825,
  "alpha": 19.99999,
  "beta": 1e-05,
  "unknownAlpha": 19.99999,
  "unknownBeta": 0,
  "score": 0.9999995,
  "unknownScore": 1
}

fonzmeister · October 12, 2020, 2:41pm

yes I get that. But I did not do a reboot before I ran the script. As I said I was away when I received the emails. A few days later I was able to run the script.

THis is the output

"118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW"
{
  "totalCount": 93708,
  "successCount": 91688,
  "alpha": 19.99999999999995,
  "beta": 0,
  "unknownAlpha": 19.99999999999995,
  "unknownBeta": 1.5455275790487003e-14,
  "score": 1,
  "unknownScore": 0.9999999999999993
}
"1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"
{
  "totalCount": 9902,
  "successCount": 9578,
  "alpha": 11.97474,
  "beta": 8.02526,
  "unknownAlpha": 20,
  "unknownBeta": 0,
  "score": 0.5987370000000001,
  "unknownScore": 1
}
"121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"
{
  "totalCount": 65898,
  "successCount": 65081,
  "alpha": 11.97474,
  "beta": 8.02526,
  "unknownAlpha": 20,
  "unknownBeta": 0,
  "score": 0.5987370000000001,
  "unknownScore": 1
}
"12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"
{
  "totalCount": 51334,
  "successCount": 50338,
  "alpha": 11.97474,
  "beta": 8.02526,
  "unknownAlpha": 20,
  "unknownBeta": 0,
  "score": 0.5987370000000001,
  "unknownScore": 1
}
"12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"
{
  "totalCount": 118235,
  "successCount": 116476,
  "alpha": 11.97474,
  "beta": 8.02526,
  "unknownAlpha": 20,
  "unknownBeta": 0,
  "score": 0.5987370000000001,
  "unknownScore": 1
}
"12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"
{
  "totalCount": 35340,
  "successCount": 34178,
  "alpha": 11.97474,
  "beta": 8.02526,
  "unknownAlpha": 20,
  "unknownBeta": 0,
  "score": 0.5987370000000001,
  "unknownScore": 1
}

BrightSilence · October 12, 2020, 3:09pm

If the node was updated in the mean time it restarted itself. It’s also possible that the node was in so much trouble it wasn’t able to even write to the logs. This can happen when disk issues are the cause of the problem and would likely also result in your node barely able to run at all and so not able to respond to audits.

fonzmeister · October 13, 2020, 8:38am

i do not have diskissues though… well I’ll gracefully exit this node and start a new one