All my nodes were suspended in February, but I updated things less than a few weeks ago and they were online working with 100% stats

What happened? Server uptime is 19 days, checked the nodes when the system booted then, they were all online with 100% or 99.x% stats. Check them today, they have been offline for 4 hours, and according to the alert in the top right everything was suspended months ago.

  1. Am I suspended? What happened?

  2. If I’m not suspended, as my stats all show I’m still at 100% or damn near minus these 4 hours, how do I get back online?

Sampling of the logs looks good, I see nothing in the last tail of each node that suggest failure. Bandwidth and utilization are still good. It still shows it is doing things with a current timestamp even though it is offline. Everything looks good until 4 hours ago when I noticed it went down and saw the suspension note.

Dug deeper, found this, but still am unsure if this is the cause:

Failed to dial storage node, and 2021-05-10T03:26:10.779Z WARN contact:service Your node is still considered to be online but encountered an error.

Edit: Sorry for all the edits, trying to add in as much info before someone responds. Digging deeper here…All of the nodes are offline, some report suspensions alerts, others have no alerts. The ones that have suspension alerts have a log and information similar to what is shown above, nothing obvious stands out, and they are still “working”.

Going to bed, checked the tail of each node log just now, they all look like the above image, current timestamps, no issues reported, downloads started, gets, etc…Everything is still “Offline.” For something “Suspended” in February, it is still reporting Current Month earnings that seem reasonable, and I have payouts for April and March. What is going on here?

Hi, I’m confused, where exactly are you seeing “offline” ? If it’s in the browser, can you do a CTRL+F5.

Node Dashboard, did ctrl+f5. Loaded in a private tab, same. Shows as offline, and shows the 4 hour thing (with traffic)…still.

Things to check:

  1. ddclient runs hourly, IP in log matches the result of dig +short myip.opendns.com @resolver1.opendns.com, which also matches my google domain (which loads without issue on https).

  2. port table is still there, all storj ports/ips are noted, along with my other services (https, etc…) (unchanged since early 2020)

  3. All local devices are static assigned IP (unchanged for years)

  4. I see nothing in the logs, no issues with ports being open closed. According to https://www.yougetsignal.com/tools/open-ports/, my IP/port for storj nodes is open

  5. Correct as of posting last night, PC runs on UTC, I was posting at midnight NYC time, screenshots above show 0400 hours. That seems right.

  6. 3 successful pings, all less than 22ms.

Edit: I still see clean docker logs with traffic, but it still reads, “Offline” on the dash. This log snippet is the matching pair of the dashboard image in this post. Timestamps are within seconds of this edit.

Capture

Still very confused.

Edit Edit: Looked back at the log, saw the satellite id, tracked that down on the node dashboard and that still shows as, “Offline”, scroll down, Suspension/Audit are both 100%.

Could you please check your firewall, it should not block incoming traffic to your node’s port and should not block any outgoing traffic.
Please, check your identity: Identity - Node Operator
If you moved identity from the default path, please use this new path in the checking commands.

Incoming traffic is not blocked. I show that the port is open online and can ping it successfully from yougetsignal.com (pm for domain/IP). Also, the docker logs show that it is downloading happily with “INFO piecestore downloaded” notes every few minutes

The Identity - Node Operator link you sent returns back the results of 2 and 3 respectively.

Then something blocking the traffic, because satellite is unable to request your node. It sends a message, not the ICMP ping, if it got no response, it consider your node as offline.
Please note, each satellite checking your node independently.

Please, try to use Chrome browser in Incognito mode on your dashboard, make sure that you looking on the dashboard of the node in question.

Unsure what you mean in your first paragraph. As for chrome, I installed it new, ran an “incognito” tab, looks the exact same as firefox.

It still shows, “4 hours ago”, which seems strange, all satellites show 100/100 on suspension/audit for that node. Why is it pinned at “4 hours ago” for the last 16 hours?

PM and you can probe the server all you like. I can’t figure out what changed.

I just ran docker stop [all the nodes], then docker start [all the nodes]. The uptime that previously was shown as 16h 6m, is now…4 hours. It should be zero. …it’s still, “Offline”

Would this prevent me from gracefully exiting the network?

Last contact - it’s a timestamp when your node is successfully contacted all satellites.
The satellite do not use an ICMP ping to determine is your node available or not, instead it sends a message to the node and expect a respond.
The last successful respond was 4 hours ago.

If your node keeps show an uptime even after restart it doesn’t looks like normal.
Please do not only docker stop, but docker rm too and run them back.

I issued:
docker stop [nodes]
docker rm [nodes]

Brought them up one by one, they all came up without error. Returning to the dashboard, they are still “Offline”, and the times still read 4hr and 4hr. If the 4 hour was accurate, it should have drifted by now…seeing as how it is a few hours later if that was really the “last contact”.

Can I still perform a graceful exit?

It’s dangerous in such case. If your node would be considered as offline, the Graceful Exit will fail after a few days offline and node would be disqualified.

curl http://localhost:14002/api/sno | jq '.lastPinged'

Replace to own port

Ay, 4 hour is the UTC offset. Weird, I’ve never seen this before, on a restart it would go to zero while maintaining UTC on the system.

Capture

curl -L http://localhost:14002/api/sno | jq '.lastPinged'

# curl -L http://localhost:14002/api/sno | jq '.lastPinged'
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed
100 44 100 44 0 0 44000 0 --:--:-- --:--:-- --:--:-- 44000
100 1416 100 1416 0 0 460k 0 --:--:-- --:--:-- --:--:-- 460k
"2021-05-10T18:18:38.593081582Z"
# date
Mon 10 May 2021 06:18:43 PM UTC

…still shows as “Offline”, last contact is +4 hr, online for +4:23hr…which I read as “recently contacted ‘just now’, and online for 23 minutes”. Still don’t get the “Offline” thing though.

Maybe it calculates Offline status from the locat time and it’s different to 4 hours? :thinking:
Could you please, run a CLI dashboard?

docker exec -it storagenode ./dashboard.sh
1 Like

I believe we’ve seen this before if the local machine has a different time zone. The web dashboard seems to compare to local time. So this is especially relevant if you’re opening the web dash board on a remote machine.

Even if the time looks correct make sure the local machine is also set to the correct time zone. Otherwise if the timezones is wrong, but the time is correct, the local system would derive UTC wrong.

Well now…that command shows status ONLINE up time of 2h2m11s

The LAN PC I am using to check serverip:14002 is also set to UTC, shows current time as 8:03PM (It’s 4:03 PM NYC).

Sounds like things are working though based on the command Alexey had me run, and your comment. Safe to ignore and keep an eye on it?

Yes I’m pretty sure your node is working fine. Still curious where this discrepancy is happening though.

I will gladly supply any logs if it helps with stomping out a potential bug…or will keep trying things if you folks like.

So, just a quick note. I switched the LAN PC (what is being used to check the dashboard) to local time (EST) instead of the matching server UTC.

Dashboard shows everything is “Online”.

Server/Node (UTC) + LAN PC (UTC) + Location (EST) = Offline

Server/Node (UTC) + LAN PC (EST) + Location (EST) = Online

1 Like