Your Storage node is suspended

XNAC · December 5, 2023, 8:55am

On Saturday 2 December 2023 I received a message by email stating that
“your storage node on AP1 satellite has been suspended for being offline for too long”
The node is active 24/7 for at least 1 year except for a 2 hour outage 2 weeks ago.
I have 4 satellites, and only one of them gives me this error.
I have collected the errors with this command
docker logs storagenode 2>&1 | grep -E “GET_AUDIT|GET_REPAIR” | grep failed
the result is this and the node that is in suspension does not appear which is
“121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”
2023-12-04T15:31:57Z ERROR piecestore download failed {“process”: “storagenode”, “Piece ID”: “OPUHQZTHACY5U6IHBUCOLPGH263NB7FKTMVCP6OZOMKAZCIUYCFA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “GET_REPAIR”, “Offset”: 0, “Size”: 1218560, “Remote Address”: “157.90.26.38:50830”, “error”: “manager closed: read tcp 10.0.3.2:28967->157.90.26.38:50830: read: connection timed out”, “errorVerbose”: “manager closed: read tcp 10.0.3.2:28967->157.90.26.38:50830: read: connection timed out\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:231”}
2023-12-04T16:54:33Z ERROR piecestore download failed {“process”: “storagenode”, “Piece ID”: “HKBYI5JFLHVT35EHBQ4AUWWBFJBBBDBC5EHI2TYOMVBHECGBGLYQ”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “GET_REPAIR”, “Offset”: 0, “Size”: 1218560, “Remote Address”: “195.201.216.62:59692”, “error”: “manager closed: read tcp 10.0.3.2:28967->195.201.216.62:59692: read: connection timed out”, “errorVerbose”: “manager closed: read tcp 10.0.3.2:28967->195.201.216.62:59692: read: connection timed out\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:231”}
2023-12-04T17:45:03Z ERROR piecestore download failed {“process”: “storagenode”, “Piece ID”: “6MJV5E7WWHOFB3SFA4R3WZ462CRZJMMKPVDZL5JPLN3TKE6VGX4A”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET_REPAIR”, “Offset”: 0, “Size”: 1218560, “Remote Address”: “199.102.71.25:54724”, “error”: “manager closed: read tcp 10.0.3.2:28967->199.102.71.25:54724: read: connection timed out”, “errorVerbose”: “manager closed: read tcp 10.0.3.2:28967->199.102.71.25:54724: read: connection timed out\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:231”}
2023-12-04T19:09:49Z ERROR piecestore download failed {“process”: “storagenode”, “Piece ID”: “SJ652IA2RACQ5MWQM4NMAFUSAIFKHCJJNQHYXQU7E4Q2PD3COTZA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “GET_REPAIR”, “Offset”: 0, “Size”: 580096, “Remote Address”: “159.69.36.227:40188”, “error”: “manager closed: read tcp 10.0.3.2:28967->159.69.36.227:40188: read: connection timed out”, “errorVerbose”: “manager closed: read tcp 10.0.3.2:28967->159.69.36.227:40188: read: connection timed out\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:231”}

I attach screenshots of what I got yesterday and what I get today, during these two days, the node has not been paid, I just restarted it since I updated the version of Docker in case that was the incidence.
There are two nodes that have lowered the online status, the other two are still at the same percentage".
Yesterday

Today

I have read these entries in the forum and support but it doesn’t clarify anything and the % online keeps going down.

The ports in the firewall are open and working, otherwise the two nodes that do not drop the online status would drop in %.

what else can I check?
Regards

elek · December 5, 2023, 11:34am

Looks to be some kind of connectivity issue. Might be a problem related to your ISP, or firewall / router or your host OS.

Different satellites uses different locations to audit / repair pieces.

Our log shows that your node was not available a lot of times (TCP connection timeout most of the time) from our repair.

That’s the reason why you couldn’t see it in your log, because there was no connection. Attempts were failed earlier.

This seems to be started at 25th of November and ended at yesterday morning. I am not sure if it’s fixed or only the activity is stopped due to the suspension.

I would recommend to check the OS level metrics / stat. If it’s a linux server, I would check the maximum number of allowed tcp connections / open files. And logs / dmesg. If you have a router, it may also block requests…

I checked the connectivity manually from one of these servers and looks good (at least the DRPC port is open), sometimes it’s slower, but still fine. That’s why I think it’s more like a TCP limit issue…

XNAC · December 5, 2023, 3:45pm

Hello
We have a Fortigate firewall with two outlets to the internet with fixed Ip. It is configured SDWAN to balance the loads, and in case of failure of one of the outputs to the Internet out the other.
The firewall has been reconfigured so that it does not block any incoming or outgoing traffic.
Regarding TCP connections, I have configured storj in a Docker inside the QNAP NAS, following the instructions of the web page for docker in LINUX (I don’t have the specific QNAP app that is indicated in the web page since I tried it and I didn’t get it to work, so I opted for docker). At the time, the configuration gave me a lot of problems and I opened a ticker in storj, and with help, I got it working and since then it hasn’t been touched at all for more than 1 year.
The TCp configuration of docker is default, and it worked perfectly without any change for more than 1 year.
How can I check if docker has limitations?

Roxor · December 5, 2023, 5:24pm

That’s… quite the firewall

Knowledge · December 5, 2023, 6:27pm

If it is slow normally but works, it may be that something on the NAS increased the latency and for some Satellites it timed out. Sounds like the extra load on the NAS has been lifted but your performance is still slow. So, it could happen again.

Alternatively, you might have intrusion protection on in your Firewall and it is turning on at times. We’ve seen that before. You may also be having an ISP issue blocking the IP / Port.

It is difficult to say with all of the variables what it could be.

XNAC · December 5, 2023, 6:48pm

Hello, i,m using a traslator, my english is bad,
The NAS is only used for the storj docker, without any kind of upload. The NAS firmware has not even been updated since storj was installed.
Regarding the firewall, absolutely everything is disabled for the NAS LAN IP, so there should be no blockages. The firewall logs have been checked and there is nothing blocked for the LAN IP.
The ISP part may be a possibility, I’m going to schedule a monitor for the IP and port every 5 minutes to see if it is blocked. At the moment, and after 30 minutes, the monitor is UP.

In the .yaml configuration file, I have a public IP
-e ADDRESS=“XXX.XXX.XXX.XXX.XXX:28967”, which is one of the two WAN IPs of the firewall, although it is not the one that the NAS currently goes out to the Internet, does it have any influence or does it not matter?

Knowledge · December 5, 2023, 6:53pm

If it is reachable and the node responds, it should be fine. Although receiving on one IP and sending on another seems to be unnecessarily complicated. Can you just isolate the nose to one specific IP/Wan?

XNAC · December 5, 2023, 7:33pm

Yes, I can, but if I don’t, can storj stop working?

Knowledge · December 5, 2023, 7:52pm

Well it seems to be working right now, so if that remains the case I guess you are good. Does it switch to the other IP at times and do you manage the DNS for that? Or is it always receiving on one and sending on another?

XNAC · December 5, 2023, 8:15pm

I have a SDWAN connected to two ISPs with fixed Ip’s, I don’t have NLB for DNS.
SDWAN balances the workload on the WAN IPs based on various parameters, such as packet loss, loss of connectivity, etc.

If as you say everything is ok, then tomorrow should improve my online %?

Knowledge · December 5, 2023, 8:21pm

See how it looks in the morning if the numbers are going up. Your situation is unusual because it appears some Sats are not having the issue that others are. And that would typically mean some kind of filtering is happening. It could be performance related, LAN related, or ISP related. Something is causing just those Sats to have an issue.

But maybe it has resolved itself. See how it looks tomorrow and we can go from there.

daki82 · December 5, 2023, 8:30pm

In this case, every node should be fixed to one IP, no wonder if load balancer switches ip continously there are errors.

XNAC · December 5, 2023, 8:31pm

it has been working like this for more than 1 year
Is it possible to assign two IPs to the node?

Alexey · December 6, 2023, 2:48am

Using a (D)DNS name - yes, but it will be used in round Robin manner - every time it will return a different IP, but if the port on that IP is not forwarded to the node in that moment - it will render as offline.
So you need to forward the same port on both IPs to that node and make sure that it would respond to the source using the same route.

XNAC · December 6, 2023, 8:24am

Ok, thanks, that would be no problem.

XNAC · December 7, 2023, 8:22am

Hi,
Today, the online ratio % is the same, is this normal?

Another question, I have a node with 80 TB. After more than 1 year, I only have 10 TB occupied on the node, is this normal?

Regards

Alexey · December 7, 2023, 8:25am

yes. It’s depends on your node, not something global.
And usage depends on customers, so again - yes - it’s normal, because the customers’ behavior is not predictable.

XNAC · December 7, 2023, 9:56am

Last Saturday I got a suspension warning, could you disable my node if it continues at these rates?

daki82 · December 7, 2023, 10:54am

Only the corresponding satelite will stop working.
Shouldn’t online recover faster? @Alexey

Knowledge · December 7, 2023, 2:24pm

Your node won’t get disabled but it could become disqualified on the satellite at issue. This cannot be undone, so hopefully that isn’t the case. Are you seeing any unusual errors in your logs?