Suddenly no more ingress

What could be the reason:
Node is not suspended, disk has space available, node has space available.
But logs only show egress.
After a stop and remove cycle the node immediately sees ingress again.
What is wrong?

1 Like

Did you have errors “ping satellite failed” before re-create of the container?

No ping errors.
Here is what I see:

Node has no ingress.

docker inspect storagenode | grep STORAGE
                "STORAGE=6.5TB"

Node dashboard:

Multinode dashboard:
mnd

116 GB overusage explain why there is no ingress. But why is there the discrepancy between node dashboard and multinode?

I am sure when I restart the node then it will get ingress immediately again.

Did you have a message “Disk space is less than requested”?
Or errors related to databases?

However, even if the node thought there was an overusage, your node still has enough free space.

I think this message shows only after startup and this has been already truncated from the logs.
The nodes that I have already restarted get ingress again and they don’t show that message. But this is how it should be.

Did you have missing audits?

No, audits are at 100% for each satellite.

One weird thing is that the node had uploads:

docker logs storagenode | grep uploaded | wc -l
183874

If I would just rely on the node dashboard it would not indicate that there are issues that could cause ingress to halt. This is not good.

yes, audits might be 100%, but I asked about numbers - did you have missing ones?

Actually you would see a drop in ingress and the flat line.

Do you want me to run the command of the linked posting?

Yes, please. They are for bash and PowerShell

I had restarted the node in the meantime so I don’t know if the results are still meaningful:

{
  "id": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE",
  "auditHistory": []
}
{
  "id": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6",
  "auditHistory": []
}
{
  "id": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S",
  "auditHistory": []
}
{
  "id": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs",
  "auditHistory": [
    {
      "windowStart": "2024-01-24T00:00:00Z",
      "totalCount": 306,
      "onlineCount": 304
    },
    {
      "windowStart": "2024-01-30T00:00:00Z",
      "totalCount": 195,
      "onlineCount": 191
    }
  ]
}

Perhaps that satellite is cached your node as offline.
Restart did a force check-in, so it’s updated its cache.

But why do the dashboards show different values too?

Did you check databases? The multinode dashboard took data via storagenode API, but the single node dashboard directly from databases and config.yaml.
So, I guess the API provided a wrong information or the multinode dashboard has a bug somewhere.
However, I cannot reproduce that.
By the way, did you have errors in the multinode console or logs?

I have a list nodes trusted satellites internal error.

Databases are ok. It is very strange.

could you please copy it in full?

ERROR   console:endpoint        list node trusted satellites internal error       {"error": "nodes: context canceled", "errorVerbose": "nodes: context canceled\n\tstorj.io/storj/multinode/nodes.(*Service).trustedSatellites:357\n\tstorj.io/storj/multinode/nodes.(*Service).TrustedSatellites:318\n\tstorj.io/storj/multinode/console/controllers.(*Nodes).TrustedSatellites:243\n\tnet/http.HandlerFunc.ServeHTTP:2047\n\tgithub.com/gorilla/mux.(*Router).ServeHTTP:210\n\tnet/http.serverHandler.ServeHTTP:2879\n\tnet/http.(*conn).serve:1930"}

so, the one of the nodes has a similar issue in its logs?

No, that is from the multinode console.
The node did not have errors.

The multinode dashboard doesn’t check satellites list, only the node is doing so, thus the similar error should be in the node’s logs on the same time.