The SLA is not met :)

BrightSilence · June 21, 2020, 9:26pm

Unless you backed up to two satellites intentionally to have a fallback if one satellite is down.
Seems a little wasteful though.

jtolio · June 22, 2020, 7:47pm

We definitely messed up a bit with email notifications. For starters, we didn’t think through the fact that most customers are not registered on status.tardigrade.io for notifications (embarrassing, obvious oversight). For the people that were registered, the status page service automatically marked the planned maintenance window resolved (when it was not!) and we didn’t properly re-trigger the maintenance on the page. So, we’re going to be working on both of those things, most likely starting with registering everyone for planned maintenance notifications.

Sasha · June 22, 2020, 11:43pm

@anon68609175, is your SNO dashboard up-time decreasing due to this planned outage?

SGC · June 23, 2020, 8:01am

explain to me again how can a distributed network have downtime…

to me that sounds basically two impossible things to reconcile, granted there will always be some amount of down time for most if not all systems…

however that one even has the power to shut it down i would regard as a flaw in the design…
the network shouldn’t care you take a satellite down for data migration, each satellite should be a cluster of satellites, granted they wouldn’t need to run the same software or db’s but they would ofc need to be able to compare their databases… but their underlying software wouldn’t have to be the same… so it allows one to upgrade a satellite, put it back in the cluster

and it will sync up with the online satellites and then one can upgrade the next one…

ofc the cluster of the satellites could consist of something like 3 -6 server (satellites) in lets call it a constellation so the first 3 sats is 1 constellation and then next 3 the 2nd… each constellation would then be a cluster (program version) and between both the constellations the databases entries are synchronized.

such a setup should allow for basically any upgrades and failures without it affecting the over all system…

and then to minimize the need for hardware one could make it so the individual satellites share servers… ofc each satellites in a cluster or in interlinked constellations would have to on individual servers…

downtime is basically unacceptable in almost all cases… however it is to be expected, and i suppose better now than later…

but yeah i don’t see why a software upgrade should require dt at all

kevink · June 23, 2020, 8:06am

Storj isn’t a distributed network in the way that you expect since the satellite is a centralized service and without it, nothing works.

they are but if you are switching the db backend then you will have difficulties doing that without any downtime.

SGC · June 23, 2020, 8:31am

not if you build for failure… assume everything should be able to be replaced while the system is running…

i know internet and internet services is still fairly new… but its like saying your electricity just stopped from time to time… that would be unacceptable… ofc before the technology if fully mature or people engineer its structure to account for individual system failures or downtime.

my lights may flux if somebody draws a lot of power, because the network has to adjust… remember our electrical network needs to adjust its production based on capacity…

and still to this day, that is mainly done by changing production… when you turn on a light bulb, a dam somewhere will open it shutters just a tiny tiny bit more… or something like that… ofc might require 1000 lightbulbs but you get the idea…

and yet that symphony of nation / continental power distribution runs sometimes for years and year most likely into decades these days in some places in some regions without even 1/10th of a second downtime…

not because it is easy… but because it has been developed over decades fixing faults with it’s construction… however many of such faults can be predicted by logical deduction, and thus be fixed before they become an issue…

internet / cloud services is just the latest for of electricity… just like cell phones was in the past… and when one looks back there is lots of learn from their developments and how and why they failed, and how to avoid problems with similar infrastructure…

but yeah technological history is a bit of a hobby of mine … so maybe i just see things in a different light.

i’m sure they will get the hang of it eventually… if not then somebody else will… not room for many electricity companies these days…

jtolio · June 23, 2020, 9:05am

To use your metaphor, what happened here was we replaced multiple lightbulbs (Satellites) at the exact same time. So the network/electricity grid continued to run, but some lights were off (we did maintenance on 3 satellites out of 6). One thing we hope to do later this year is get more people running their own Satellites. We run 6 right now (to make sure we don’t start depending on any specific one) but hope to expand to more Satellite operators soon!

Somewhere else (was it this thread?) @BrightSilence recently referenced this great blog post that talks more about our strategy: The Electric Car Example Applied to Decentralized Cloud Storage

It’s worth pointing out that going forward, we don’t intend to take Satellites down like this any longer. As I mentioned earlier, this is one last pre-production holdout we finally accomplished. We absolutely could have done this final database migration we had planned without downtime per Satellite of course of course, companies like Google do that kind of thing all the time, but it would have taken significantly more engineering work for this one time operation. We commit to do that engineering work going forward.