Offline since 0.22.1 update. What am I missing?

So it looks like I’ve been offline since the 0.22.1 update (or sometime else in the last few days), I’ve spent the last hour troubleshooting this and can’t find anything wrong in my setup. I’ve verified port is open and server connectable via yougetsignal.com port checker. The docker logs look fine it just repeats ‘INFO version running on version v0.22.1’ every 15 minutes.

It is normal that it takes a while before your node comes back online after this release, please check the changelog for more details.

Yet another terrible design decision :frowning:

Thanks for the tip.

1 Like

Yet another user that didn’t read the changelog and more important doesn’t know the reasons behind that decision. :slight_smile:

4 Likes

Yet another user that didn’t read the changelog

Yet another changelog that wasn’t tagged to announcements. I don’t have time to go diving into the depths of this forum every release and read up and figure out how said changes will affect the user experience. I support this project obviously, but the communication around significant UX changes has so far been far from adequate. If this is ever to be a legitimate service there is so much work around UX and patching that needs to be done.

You can blame the user if you want but the multiple threads about offline nodes in the last few days indicates to anyone not drinking the koolaid that this was not communicated well.

1 Like

You received an email about this update and with a link to the changeling as well.

https://forum.storj.io/tags/official-changelog

On the top right you will find a button to subscribe. You will get different kinds of notifications every time I post a new changelog.

It was very clear in the changelog. What is adequate for you? Maybe a personal phone call?

You are welcome to submit your ideas and PRs. Which events do you want to use for online / offline detection? (Hint: It is a trap. Better don’t answer this question. I will hunt you down the rabbit hole until you realize that we had no better option at the moment.)

There is always at least one user that will not read the changelog. Scroll up in this thread and you will find the frindly answer we will give them. Only one user decided to call that frindly answer “terrible design decision”. That was the moment I joined this thread to show you a mirror.
If you are frindly to me I will also be frindly to you. Very simple rules. So how about we forget this discussion and try again in the next thread.

3 Likes

Uptime robot told me node was offline, looked like my ddns asuscomm.com was down, so I switched to another, verify I can reach storj port from online tool, and restarted my node with new ddns. Still offline, no error in log. Had been running fine continuously since last update.

Storage Node Dashboard ( Node Version: v0.22.1 )

======================

ID 1q9p9xjbxwsBPFnKsQ85VwZtBfT22unLAs9TqyU3VXdyYZtuD3
Last Contact OFFLINE
Uptime 21m3s

               Available        Used      Egress     Ingress
 Bandwidth      300.0 TB     14.5 GB     13.4 GB      1.1 GB (since Oct 1)
      Disk        2.7 GB      1.0 TB

Bootstrap
Internal 127.0.0.1:7778
External grich.myftp.org:28967

The online status updated less often:

Really? This is an easy UX fix. Your changelog states: “OFFLINE after a restart for a few minutes because we designed the checkin methode in a way that storage nodes will not send a message on every restart”

So the node knows its own restart status, change the initial start up status from OFFLINE to something like CONTACTING SATELLITE or WAITING FOR CHECKIN or even just STARTING. Not only does this give more info to an end user, you get more info in bug reports/tech support threads, as this instantly lets you know if they have not been able to contact a satellite at all since restart or something happened after ONLINE to cause it to go OFFLINE.

Good UX means I should not have to read the changelog unless I have an interest, it should be informative and expounding, not required.

So it seems you agree reading the changelog should not be required. I don’t know why you would take this as a personal affront. If multiple users end up here confused, it was a bad design decision by definition. You can own that and try to learn more about UX or instead misinterpret feedback as unfriendly and thus miss out on the point and a learning experience. There is yet another person experiencing this same confusion that posted above in this very thread. At what point does the feedback sink in?

The email delivery reliability for whatever service is being used is also terrible. It’s super hit or miss, I get some of the changelog emails, even got an nice “Hey your node is offline” email, but nothing about this specific changelog, not even in Spam folder. It’s always been this way, witness the invitation email debacle where people were getting multiple invites, even weeks apart. Dropping the changelog in the container and github would reach a lot more people.

3 Likes

I have some other fixes in mind to get a better balance between protecting the satellite against DDOS and allowing the storage nodes to change port without up to 1h downtime. The UI is not the real problem here.

You started blaming and I wouldn’t call that good feedback. There is nothing I could learn from it. Now you had one idea that might be usefull (if the other fixes are not enough). Again I have to point out that I prefer a better mood.

So let me try to get a better communication in here by doing the first step. I would love to work with you on a solution. I kindly request that we follow the rules of constructive criticism.

I’ll jump in the fire here… Why not?

I think there are two separate things this paragraph…

  1. Practical network improvements from an architectural stand point.
  2. Notification to the node operator of “What is going on”.

It seems from your post, that you are concerned with point 1… However, node operators are less concerned with point 1 and more concerned about point 2. It’s more important for a node operator to be informed of the present condition and future expected condition of one’s node rather than seeking to improve the overall Storj architectural response.

If instead of a simple “Offline” or “Online” notification, if the indicator showed a Status message based on the error messages in the node’s own log that would help node operators immensely. And such messages would not impact the satellites at all. It’s mostly parsing the log and translating the message stream into something meaningful to a non-expert in the inner workings of the SNO software.

So, in other troubleshooting posts, there have been errors such as “Failed to ping satellites” …

The Dashboard will show “Offline” … but that message is not helpful in figuring out why the node is offline.

2 Likes

I totally agree with expanding the Last Contact section. To avoid DDoS’ing the satellite, on initial startup of the node sleep for a random time, say up to a minute with status “STARTING”, then try initial contact, at which point the status changes to “ONLINE”. Leaving it starting “OFFLINE” for an unknown period of time will prevent these excessive forum posts.

Let me answer with an simple example. You start your node for the first time and it sends the checkin message to the satellite. Now you want to change the port and you have to restart the node for it. The satellite doesn’t know about that port change. For the next 59 minutes you will be offline. We can change the wording from offline to something else but you will still be offline for 59 minutes. I would like to find a solution for the 59 minutes. As soon as we have that fix in place we can take a look at the frontend.

I would go even one step further. Is there a reason for having a 1 minute delay? Maybe I can convince the developer team to send the first request on startup. At that point we can work with OFFLINE/ONLINE just fine. Yes this comes with a new problem. We have to find a better way to deal with the DDoS problem. Do not update all storage nodes at the same time. Even if they update at the same time pick a random time for the second checkin. Lets see at which point the developer team accepts the ticket.

We will not get this into release-alpha23. At the moment the priority is to remove all the kademlia code including bootstrap. Maybe we can get it for release-alpha24 in 2 weeks or a cherry-pick in between.

Good news. My jira ticket was accepted without any issues. The fix is incoming and should get into the next release.

3 Likes