The SLA is not met :)

anon68609175 · June 21, 2020, 10:22am

Fine, total offline more than 12h for 2020.
Current uptime for 30 days less than 99.95%. How can we trust tardigrade?

anon68609175 · June 21, 2020, 10:33am

Currently offline too

BrightSilence · June 21, 2020, 10:34am

Those uptime numbers cited usually exclude scheduled maintenance like today.

Though the scheduled maintenance is supposedly done according to the status page…

But satellites still seem to be offline.

anon68609175 · June 21, 2020, 10:35am

Yes, but total maintenance already 12h. I’give links to 4h+4h+4h windows.

anon68609175 · June 21, 2020, 10:43am

Uptime lower than 99%…

BrightSilence · June 21, 2020, 10:43am

I started monitoring them on January 22nd. I’ve not seen significant down time since production launch.
europe-west-1 has the “worst” track record.

The earlier downtime of more than an hour was prior to production launch. And this one was scheduled, so probably isn’t going to count against the 99.95% number.

I don’t know whether the January down time was scheduled as well.

Other customer facing satellites have had no down time between January 22nd and now.

anon68609175 · June 21, 2020, 10:47am

The strangest thing is that what idiot thought that absolutely complete unavailability of all services is normal?

Odmin · June 21, 2020, 10:52am

Clients do not care about is it planned or unplanned downtime, they just lose service for any downtime.
So, unfortunately, SLA should include any downtime that impact availability for clients.

Unfortunately, this monitoring is not correct, respond to “ping” and “port is open” doesn’t mean that service is working

Odmin · June 21, 2020, 10:58am

I also see something is strange in a notification system:

got mail with:

but status page:

BrightSilence · June 21, 2020, 11:04am

I wasn’t making a judgement, just telling how it probably is. In my experience scheduled down time is excluded from these uptime numbers.

Agreed, best I have for now, but I’ve not seen any complaints about the service not working either.

It looks like they started a new maintenance window. The old one probably expired automatically at 10:00UTC

Odmin · June 21, 2020, 11:11am

I also agree with you, any situation is possible, anyone does not saw SLA agreement from Storj Labs…
(we are remembering the devil is in the details! )

In my experience, if the agreement includes scheduled downtime and this time is excluding from uptime availability, it also should be limited and should have planned “maintenance windows”.

BrightSilence · June 21, 2020, 11:20am

9. Service Level Agreement(“SLA”)
a. Company will use commercially reasonable efforts to meet the following service level commitment: except for scheduled maintenance, the Storage Services will be available 99.95% of the time. We calculate availability based upon the service records we maintain. We will use reasonable efforts to notify you in advance of any scheduled maintenance.
i. Our SLA obligations do not extend to any unavailability of the Storage Services that is caused by: (i) any hardware or software that you use in connection with the Storage Services; (ii) misuse of our Storage Services, including use in breach of the Agreement or use other than in accordance with any content or Documentation or other instructions provided by Company; (iii) circumstances or events beyond the reasonable control of Company; (iv) maintenance or scheduled downtime; or (iv) our suspension or termination of your access to the Storage Services pursuant to the rights we have reserved under the the Storj agreements.

Scheduled Downtime. Scheduled Downtime will generally occur during the Maintenance Windows. Company will endeavor to provide notice at least eight hours in advance of any scheduled downtime occurring outside of the Maintenance Windows.

Maintenance Windows. Company has an optional weekly maintenance window on Sundays from 2:00 a.m. EST/EDT to 6:00 a.m. EST/EDT during which scheduled maintenance, upgrades and repairs can occur.

Usage of the maintenance window is scheduled according to the Company release calendar.

Company may also perform emergency maintenance in a non-standard maintenance window.

Company will use commercially reasonable efforts to perform emergency maintenance at the time of lowest use levels, as determined by web use logs from the previous month.
2. Emergency maintenance windows will last no longer than four (4) hours.
3. Company reserves the right to use two (2) emergency (non-scheduled) maintenance windows per year. Emergency maintenance beyond these two (2) additional windows will be considered downtime.
4. Company will inform User about all relevant changes planned for the upcoming maintenance window no less than one (1) weeks prior to the maintenance window.

This downtime was announced several weeks in advance.

I’m not saying this is the perfect way to do it, but they’re following their own SLA and not breaking it like the topic title suggests.

anon68609175 · June 21, 2020, 11:31am

In ToS scheduled downtime, which shall not be more than 12 hours per year.

jtolio · June 21, 2020, 4:48pm

Hey friends!

I just wanted to jump in now that I’m awake again to add a bit of context about this specific scheduled downtime and what we expect going forward!

First off, I need to reread the SLA, but the intention was that we have two different types of downtime.

One is 4 hours on Sunday morning, 2-6am eastern US time, which requires a week or more advanced notice. I understand and agree with the points that the customer doesn’t care if it’s scheduled or not, so we will be working to eliminate these, but as it stands, 2-6am eastern US time downtime if notified in advance does not count against the current SLA.

The second type of downtime we also used today, which is emergency downtime, which we have a small maximum of per year. I need to confirm but I think we ate half of our entire yearly budget, due to the migration running long.

This was an exceptional migration that we don’t expect to happen again going forward. Our plan prior to production was to use Cockroach DB for object metadata, but due to time constraints we did not make that migration happen prior to production. All object metadata is now on Cockroach DB, which is what we intended from long before production launch, and do not expect to change this again going forward.

Thank you all for your patience with this migration! We’re really excited by the new performance and scalability characteristics now available to Satellites with the new Cockroach DB backend. It will be a much more scalable service going forward!

anon68609175 · June 21, 2020, 4:56pm

For example, Google cloud does maintenance in such a way that end users don’t even notice interruptions.
Why can’t you do maintenance without the user impact? As I understand that bad architecture affects it. However, here in everything bad, even on SSD DB locks because of a bad structure.

Bla-bla-bla… In normal companies, there are generally closed test servers for " try this, try that…". This is PRODUCTION, you CAN’t be so irresponsible. It seems to me that you have no clients because of your bad attitude and new ones don’t want to come.

BrightSilence · June 21, 2020, 5:15pm

Thanks for the additional information. I noticed it running long, that’s unfortunate. I’d like to give a tip to make sure in such cases mails don’t go out to say the service is back online at the end of the planned maintenance window. This caused a bit of confusion. It would be nice to get an email about the maintenance running long at that moment instead.

nerdatwork · June 21, 2020, 5:28pm

Wouldn’t getting an email about maintenance being complete mean service is back up ?

BrightSilence · June 21, 2020, 5:29pm

Yes, but it wasn’t. That was the problem. The mail went out when the planned window ended, but before the work was done.

peem · June 21, 2020, 7:06pm

But to take several satellites out of service at one time? This is very much like centralization …

kevink · June 21, 2020, 9:20pm

Where’s the difference? your data is only accessible through one satellite anyway.