Suddenly no more ingress

jammerdan · January 30, 2024, 4:05am

What could be the reason:
Node is not suspended, disk has space available, node has space available.
But logs only show egress.
After a stop and remove cycle the node immediately sees ingress again.
What is wrong?

Alexey · January 30, 2024, 4:23am

Did you have errors “ping satellite failed” before re-create of the container?

jammerdan · January 30, 2024, 4:43am

No ping errors.
Here is what I see:

Node has no ingress.

docker inspect storagenode | grep STORAGE
                "STORAGE=6.5TB"

Node dashboard:

Multinode dashboard:
mnd

116 GB overusage explain why there is no ingress. But why is there the discrepancy between node dashboard and multinode?

I am sure when I restart the node then it will get ingress immediately again.

Alexey · January 30, 2024, 4:54am

Did you have a message “Disk space is less than requested”?
Or errors related to databases?

However, even if the node thought there was an overusage, your node still has enough free space.

jammerdan · January 30, 2024, 5:01am

I think this message shows only after startup and this has been already truncated from the logs.
The nodes that I have already restarted get ingress again and they don’t show that message. But this is how it should be.

Alexey · January 30, 2024, 5:02am

Did you have missing audits?

jammerdan · January 30, 2024, 5:12am

No, audits are at 100% for each satellite.

One weird thing is that the node had uploads:

docker logs storagenode | grep uploaded | wc -l
183874

If I would just rely on the node dashboard it would not indicate that there are issues that could cause ingress to halt. This is not good.

Alexey · January 30, 2024, 6:52am

yes, audits might be 100%, but I asked about numbers - did you have missing ones?

Actually you would see a drop in ingress and the flat line.

jammerdan · January 30, 2024, 7:05am

Do you want me to run the command of the linked posting?

Alexey · January 30, 2024, 7:17am

Yes, please. They are for bash and PowerShell

jammerdan · January 30, 2024, 7:23am

I had restarted the node in the meantime so I don’t know if the results are still meaningful:

{
  "id": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE",
  "auditHistory": []
}
{
  "id": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6",
  "auditHistory": []
}
{
  "id": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S",
  "auditHistory": []
}
{
  "id": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs",
  "auditHistory": [
    {
      "windowStart": "2024-01-24T00:00:00Z",
      "totalCount": 306,
      "onlineCount": 304
    },
    {
      "windowStart": "2024-01-30T00:00:00Z",
      "totalCount": 195,
      "onlineCount": 191
    }
  ]
}

Alexey · January 30, 2024, 7:36am

Perhaps that satellite is cached your node as offline.
Restart did a force check-in, so it’s updated its cache.

jammerdan · January 30, 2024, 8:31am

But why do the dashboards show different values too?

Alexey · January 31, 2024, 2:54am

Did you check databases? The multinode dashboard took data via storagenode API, but the single node dashboard directly from databases and config.yaml.
So, I guess the API provided a wrong information or the multinode dashboard has a bug somewhere.
However, I cannot reproduce that.
By the way, did you have errors in the multinode console or logs?

jammerdan · January 31, 2024, 3:54am

I have a list nodes trusted satellites internal error.

Databases are ok. It is very strange.

Alexey · January 31, 2024, 4:10am

could you please copy it in full?

jammerdan · January 31, 2024, 4:30am

ERROR   console:endpoint        list node trusted satellites internal error       {"error": "nodes: context canceled", "errorVerbose": "nodes: context canceled\n\tstorj.io/storj/multinode/nodes.(*Service).trustedSatellites:357\n\tstorj.io/storj/multinode/nodes.(*Service).TrustedSatellites:318\n\tstorj.io/storj/multinode/console/controllers.(*Nodes).TrustedSatellites:243\n\tnet/http.HandlerFunc.ServeHTTP:2047\n\tgithub.com/gorilla/mux.(*Router).ServeHTTP:210\n\tnet/http.serverHandler.ServeHTTP:2879\n\tnet/http.(*conn).serve:1930"}

Alexey · January 31, 2024, 4:33am

so, the one of the nodes has a similar issue in its logs?

jammerdan · January 31, 2024, 5:07am

No, that is from the multinode console.
The node did not have errors.

Alexey · January 31, 2024, 5:15am

The multinode dashboard doesn’t check satellites list, only the node is doing so, thus the similar error should be in the node’s logs on the same time.