Suddenly no more ingress

Alexey · April 2, 2024, 7:28am

I think it should resume the handling uploads as soon as it updates its stats on the satellites.
However, I’m not sure why it may happen unless you have had less than 500MB free (by the information in the databases), because in a such even the node will notify the satellites and start to reject uploads.

Do you have warning or info messages with rejecting uploads or complaining on a remained free space?

jammerdan · April 2, 2024, 7:49am

That’s another reason why I have not restarted yet, because I hoped it would maybe resume when it check in with the satellites. But it did not happen yet and as you can see last upload is from 30th of March.

I have a warning Disk space is less than requested. But that was on the 29th. There have been uploads after that message. And I don’t have log messages with reject in it.

Alexey · April 2, 2024, 8:04am

Perhaps that’s the root cause. This check seems happening only on start.
How much the difference between requested and the allocated space?
How much is really used?

jammerdan · April 2, 2024, 8:29am

I have this message:

Disk space is less than requested. Allocated space is   {"process": "storagenode", "bytes": 6857720640512}

But from docker inspect: "Env": ["STORAGE=7.5TB",..... and df -H way above 1TB back then if I remember correctly. Currently it is still over 500GB free physically on host.

Alexey · April 2, 2024, 8:34am

Seems it’s sure that you have only 6.8TB (with 6.05TB used), however, it’s still much more than a minimum 500MB free.
Interesting. Passed to the team, I do not have any ideas right now (and never saw on my nodes).

To get a free space in SI units you need to use --si instead of -H option.

jammerdan · April 2, 2024, 8:59am

Thanks.
So I can restart it now and see if it resumes uploads?

jammerdan · April 3, 2024, 1:45am

Just for the record: Still no ingress on that node.

Can I restart it now?

jammerdan · April 4, 2024, 2:38am

After a full restart the node is receiving ingress immediately again.
As expected.
So whatever it is, there is an issue that prevents nodes from being properly reported to the satellites as available for uploads.
Basically that means that we have to check every node on a regular basis if it is still receiving uploads which is not good.

pangolin · April 4, 2024, 3:17am

I have noticed this problem several times. A restart always solved it so I did not care about the cause. Next time I will check the logs.

JWvdV · April 4, 2024, 3:44pm

I actually found out, to have the same issue on just one node today. That apparently stopped receiving ingress for over an hour.

12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.205.227:57060", "Size": 14848}
2024-04-04T08:45:34Z    INFO    reputation:service      node scores updated                                            {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Total Audits": 0, "Successful Audits": 0, "Audit Score": 1, "Online Score": 1, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2024-04-04T08:45:35Z    INFO    reputation:service      node scores updated                                            {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Total Audits": 0, "Successful Audits": 0, "Audit Score": 1, "Online Score": 1, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2024-04-04T08:45:35Z    INFO    reputation:service      node scores updated                                            {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Total Audits": 79, "Successful Audits": 59, "Audit Score": 1, "Online Score": 0.7509803921568627, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2024-04-04T08:45:36Z    INFO    reputation:service      node scores updated                                            {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Total Audits": 85, "Successful Audits": 57, "Audit Score": 1, "Online Score": 0.6782051282051282, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2024-04-04T08:55:45Z    INFO    Downloading versions.   {"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2024-04-04T08:55:46Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.99.3"}
2024-04-04T08:55:46Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode"}
2024-04-04T08:55:46Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.99.3"}
2024-04-04T08:55:46Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode-updater"}
2024-04-04T09:10:45Z    INFO    Downloading versions.   {"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2024-04-04T09:10:46Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.99.3"}
2024-04-04T09:10:46Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode"}
2024-04-04T09:10:46Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.99.3"}
2024-04-04T09:10:46Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode-updater"}
2024-04-04T09:19:03Z    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "UICGR4O2ZY4MWNK6J734MC6PZFJMLUKVGPBHJ3LES7XMCVUAFBMQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 5376, "Remote Address": "79.127.226.97:40450"}
2024-04-04T09:19:03Z    INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "UICGR4O2ZY4MWNK6J734MC6PZFJMLUKVGPBHJ3LES7XMCVUAFBMQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 5376, "Remote Address": "79.127.226.97:40450"}
2024-04-04T09:25:45Z    INFO    Downloading versions.   {"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2024-04-04T09:25:46Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.99.3"}
2024-04-04T09:25:46Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode"}
2024-04-04T09:25:46Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.99.3"}
2024-04-04T09:25:46Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode-updater"}
2024-04-04T09:40:45Z    INFO    Downloading versions.   {"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2024-04-04T09:40:46Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.99.3"}
2024-04-04T09:40:46Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode"}
2024-04-04T09:40:46Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.99.3"}
2024-04-04T09:40:46Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode-updater"}
2024-04-04T09:40:47Z    INFO    bandwidth       Performing bandwidth usage rollups                                     {"Process": "storagenode"}
2024-04-04T09:40:47Z    INFO    collector       deleted expired piece   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "AFQHJLBUROZVLIOFBAK4RZXFAHAHNDQPRYYAGTCUUO2RLF4NW3YA"}
2024-04-04T09:40:47Z    INFO    collector       deleted expired piece   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "66URVCCIIRLHVAT6JCWFJTXRHNKK35BUJSJYFRXZRKOIFFNPME3A"}
2024-04-04T09:40:47Z    INFO    collector       deleted expired piece   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "CXLT7TGFPSJEKDGXWZ7CUUBIYF32RSZJUC2GQD7PYWCTWMFPYAOA"}
2024-04-04T09:40:47Z    INFO    collector       deleted expired piece   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "4CIXFTIBCFXDSAZ3LCDUOKIRFIRCMG2FYV6KMSUF5ACQQPFBNFRQ"}
2024-04-04T09:40:47Z    INFO    collector       deleted expired piece   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "TFIRXGETVGUBH6GM4YZN3KVG4JJ33C6OKLH7YAAPH6X6NJPHN5UA"}
2024-04-04T09:40:47Z    INFO    collector       deleted expired piece   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "6PN5B5XO5OMU5HZ7FHGGGO6MEOXGRWPEZSKPTRWUWNYDXH4HTYUA"}
2024-04-04T09:40:47Z    INFO    collector       deleted expired piece   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "4QSMWHDPWGIRW5A3RKGFLGJYO7WHZYTEROWU77ECDIU36OUKASGA"}
2024-04-04T09:40:47Z    INFO    collector       deleted expired piece   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "JHPLMWCNUXGTSKHDGGAYYRYPIN7HDCDL2NXG24JETQJZRLA7ITTA"}
2024-04-04T09:40:47Z    INFO    collector       deleted expired piece   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "M6WPOEA7KIH7ZMHPD3GA3LW7S7ZN6QHV4RWB4FWPTROM7VR3SBRQ"}
2024-04-04T09:40:47Z    INFO    collector       deleted expired piece   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "ZMJ5WWRFUZ56AOUOFOXL45G72N4LRYZ3P4FG3DL3AEYCRMN7NB6Q"}
2024-04-04T09:40:47Z    INFO    collector       deleted expired piece   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "3LELAYTNSMO23ZCDYYLAI6V37PQM4QC4RQ6DDGGTIBMB56BXEFMA"}
2024-04-04T09:40:47Z    INFO    collector       collect {"Process": "storagenode", "count": 11}
2024-04-04T09:41:14Z    INFO    orders.12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs                             sending  {"Process": "storagenode", "count": 1}
2024-04-04T09:41:14Z    INFO    orders.121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6                             sending  {"Process": "storagenode", "count": 1}
2024-04-04T09:41:14Z    INFO    orders.12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs                             finished {"Process": "storagenode"}
2024-04-04T09:41:15Z    INFO    orders.121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6                             finished {"Process": "storagenode"}

Just a restart solved the issue.

pangolin · April 5, 2024, 1:46am

It just happened again. This was the last finished ingress I found in the log:

2024-04-04T19:12:08+02:00	INFO	piecestore	upload started	{"Piece ID": "WS6ES6HWXAAZUMXQONMYHYMAJ6QDOOUJJGU4GIOXCFAXRPK5ENFQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 649848045568, "Remote Address": "79.127.223.129:42570"}
2024-04-04T22:06:43+02:00	INFO	piecestore	uploaded	{"Piece ID": "WS6ES6HWXAAZUMXQONMYHYMAJ6QDOOUJJGU4GIOXCFAXRPK5ENFQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Size": 2319360, "Remote Address": "79.127.223.129:42570"}

Please note the long delay. This doesn’t make sense to me.

Could the root cause for the no ingress problem just be to many pending uploads because there is no timeout at node side if the satellite doesn’t cancel the upload? This could explain why a restart always solves the problem for some time.

JWvdV · April 5, 2024, 6:12am

In my case, is also the slowest disk I have attached to the node in which this occurred. What disk has been attached to your node?

pangolin · April 5, 2024, 1:23pm

SSD (Samsung QVO 8TB)

JWvdV · April 5, 2024, 7:24pm

Doesn’t sound like very slow actually

Alexey · April 15, 2024, 7:07am

The satellite cannot cancel the upload - it’s not proxying the traffic. The customers’ uplink uploads directly to your node. So it’s possible, that some customer didn’t cancel uploads, but the connection will be terminated anyway (up to 48h as far as I know).

So far I have several ideas:

for some reason the node could report to the satellite as full;
the public IP has changed and DDNS didn’t update it in time, but in this case the node will be rendered as offline;
there was a temporary connection issue and the node was removed from the hot cache on the satellite and suggested to the customers not so often as before, until it again will not be added to the hot cache;

Could you please check - did it have audits requests between these two dates?
Also please check for missing audits:

Toyoo · April 16, 2024, 10:42pm

I’ve seen this on my nodes, with even longer intervals. In my case these caused in some weird cases running out of resources for sockets. I’ve decided to make use of the --storage2.min-upload-speed switch set to a ridiculously small, but non-zero value (in my case this is 1kB/s), which kills these very slow uploads after few minutes.

Alexey · April 17, 2024, 7:44am

wow, it’s interesting! And quite possible: Storagenode creates way too many TCP Sockets.
And a nice workaround by the way. This connection should be terminated by the customer anyway due to a long tail cut.

How did you figure out this case?

Toyoo · April 19, 2024, 6:43pm

I recall seeing some pretty obvious entries in the kernel log. I don’t remember the exact messages used there, sorry.

Alexey · April 20, 2024, 3:52am

I never saw such a problem, but I guess that it also should throw exceptions to the storagenode’s logs too.

jammerdan · May 6, 2024, 7:45am

Happened today again on several nodes.
No ingress since today.
All of them restarted without issue and received ingress immediately again.