Production readyness?

Krey · January 23, 2020, 1:18pm

Dear StorjLabs and Community,
I write this message from my heart, trying to express the opinion of many. For everyone, who agrees with my message: please mark this message with heart, or write your own opinion if you do not agree.

Looks like that StorjLabs hurry too much to get in production state in January and observe the schedule, forgetting their own words about there is only one chance to do it right way. There is no second chance to build the network.

The closer to the product release, then greater the feeling that the project is raw, and not ready to launch yet. Too serious changes have been made last months, many of them are not sufficiently tested.
Perhaps the most important thing is that most SNO unable to test the network by upload and download their files. as for whom has received invites: their reviews often not very happy.

There is no chance to look at SNO’s personal escrow and other statistics, founded by satellites; compare it with own calculations (at SNO side via scripts etc…) One of the options for the node’s health (uptime) has just been removed without providing an alternative method.

We read about qualification gates, but these numbers are not available to us, they are not verified since you turned off all third-party monitoring, such as the site storjnet.info not providing anything instead.

Things like this make us worry about the success of the first launch and the project in general at this stage. It seems to us that it is worth spending a little more time on working out issues that are important from the point of view of the overall network’s operability and from the point of view of development backlog and from the point of view of testing and from our SNOs, for the implementation of must have functionality. Let us to test the project by uploading/downloading our files to the network. Collect our feedbacks. We are afraid that v3 will turn into v2.2 when the network consisted exclusively of test traffic and data, because clients data will be unavailable to download it, because of insufficient testing and feedback lack. Or dozen of new unknown yet reasons.

Personally, as a lover of computer history, saw too many failed projects, both because of rush and because of delay and over-engineering. It is important to maintain balance. Perhaps a few additional months to polish fix and testing the current solution and create missing important tools will benefit the project.

jocelyn · January 23, 2020, 1:31pm

Hi Krey – first, thank you for the heartfelt post. I have a few thoughts. please bear with me as Im a slow typist

jocelyn · January 23, 2020, 2:16pm

Ok, here are my thoughts as Im reading. There will be some gaps in my reply, because there are some answers that id prefer to check with my colleagues. So I can expand on this later as conversations develop

I agree with you that there is only one chance to make a first impression. We’ve tried to prioritize things in a way that honor that. We’ve had to work on functionality first. However I (and others here) agree with you that polishing is also a very important step. As Im sure you know, a little bit if the polishing can sometimes happen as you’re doing functionality. But usually its functionality first. That doesnt polish make it less important, it just has to do with the process. (Im sure you already know this, becuase I can tell from your writing that youre at a proficient level, so forgive me if I state the obvious )

Regarding qualifications: Im currently in the midst of all the planning and logistics for the jan 31st town hall. And I know that there is a slide in the deck talking about qualification gates. I cant comment on the specific values under each column ahead of time though unfortunately.

UX/Product Design have been doing testing with SNOs – and bringing those insights back to engineering. Video diaries, surveys, and stats are looked at. and we actually have regular meetings on the topic that happen weekly. i think we can all agree that results are what counts. But to get those results there is a lot of work that happens internally that isn’t always visible to the community.
We’re pushing towards more transparency in that regard, which is the reason we’re talking about (for example) moving some of our Slack channels into the forum. So that people understand our process , how we make decisions and what tradeoffs we make along the way.

This has got pretty long ! Sorry! I also know that there are some points you brought up that I didnt even get to yet. Thats not because I don’t want to answer, Its just because I don’t want to make an improper answer inadvertently. And our office isn’t open yet, so I cant run it by my colleagues til the building’s unlocked lol

At the risk of sounding too sales-y, I really hope you can tune in for town hall, because I think it will address some of the excellent points you bring up.

I can work on distilling some of your questions here as I understand them into the Q&A section of the town hall as well, and circulate your topic internally.

Again, I super appreciate your care and thought on this topic. And that your’e opening it up to discussion in such a productive manner. Thank you.

will.topping · January 23, 2020, 2:19pm

Look forward to this being addressed in the town hall @jocelyn

jocelyn · January 23, 2020, 2:20pm

@Krey sorry I told you I am a slow typist!
I also want to talk with some of my colleagues. (its still dark out here, so very few folks around!) And Id rather give you a better answer that takes a bit more time, than to risk muddying the converation with a hasty response. I hope my first draft - humble as it may be – is at least somewhat helpful.

littleskunk · January 23, 2020, 2:34pm

I agree that we have only limited test capacity and we have to focus on the critical parts. In the last few weeks I started to publish some of the “low priority” tests. The test results are great and very useful for us. Can you tell me which parts you are worried about? I should be able to write down a high level test plan and together we can execute the tests.

You can always run tests with storj-sim even if you don’t have an account on a tardigrade satellite. I would love to see the same review with the new version that should go out today. It will not be the last improvement before we go into production. If it helps I am also happy to post a summary which issues we currently have, some background information to explain why this is happening and at the end how we fixed it. It might be the time to be a bit more transparent on that. Yesterday I only droped a short “it will get fixed with v0.30.5” but didn’t explain any backgound information. Just ask me if you are interested and I will write down all the details you want to know.

As a storage node operator I agree and every day I try to convice the developer team that this would help to keep the storage nodes onboard and a few other side effects. On the other side the question is should we postpone onboarding customers because of that? Does a customer need this feature? I can understand that argument as well.

At the moment uptime is not important at all and in fact it was and still is broken. I don’t think there is any use in showing incorrect numbers. The real question here should be when do we fix downtime tracking in order to show something useful.
As an alternative I would recommend a simple uptime robot that pings the storage node port from time to time. Everyone should have that even if we would show uptime numbers on the dashboard.

This is the first time I have to disagree. storjnet.info was showing private information like my wallet address. No third party tool should be allowed to collect any private information about my storage node. I have to opt out and I am happy that storj has fixed it.

Over all thank you for this feedback. Lets keep this discussion rolling

Krey · January 23, 2020, 2:55pm

It showed it because it could, and accordingly, any person could receive it.
I think that it was worthwhile to close only information of this type, and not generally all.

littleskunk · January 23, 2020, 3:07pm

You can read about that part here: Design Draft: Removing Kademlia

kevink · January 23, 2020, 3:12pm

Another thing I am worried about after reading most posts on this forum is 100% aws s3 compatibility which currently doesn’t seem to be the case. This (imho) has to be properly implemented and tested before onboarding any customers as it could result in a bad customer experience with permanently losing that customer.
Also there seem to be problems with upload/download and delete speeds reported by some.

By what I read, to me it doesn’t look like STORJ will be production ready within the next month and will need that time to properly test everything.
Of course there will be bugfixes and improvements along the way after production launch that are not that critical to customers like a better SNOboard etc

Krey · January 23, 2020, 3:13pm

Yes, I know why you remove the Kademlia and do not argue with this, but it happened a little later than the closure of unauthorized connections.

I just suggest thinking about opening some kind of interface for third-party audit on a common port. And regular post network statistic gathered from satellite(s) somewhere.

Krey · January 23, 2020, 3:23pm

My poor English does not allow me to participate in online conferences. I even read this topic much more slowly than you write.
But I’m sure there will be a lot of colleagues with better knowledge of English who are also concerned about the topic.

littleskunk · January 23, 2020, 3:34pm

Where did you read that? My information is that we support the basic but not all s3 functions. If you can point me to some offical documents than I will start to file a lot of bugs.
We have tests in place to verify the basic s3 functions. We can dive into it if you like.

Again please retry as soon as v0.30.5 is released today. It should improve listing and delete performance. As a side effect I would also expect fast uploads and downloads.

I am not allowed to talk about any dates but why do you think we need time to test everything? What exactly should we test? I am testing the hole system for a full year now and I am not planning to stop that just because we call it a production release at some point.

littleskunk · January 23, 2020, 3:39pm

You can. The storage node dashboard API contains a lot of information. You could send these information to a third party app if you like. I prefer to opt out and don’t want to send any information to third party apps.

How about the payouts on the ethereum blockchain. You can collect these information and get a good picture of the network that way. You can even see which satellite has the highest payout.

I understand that you want to double check the numbers storj will post in the town hall. I am doing that myself. However it is not a requirement for production release

kevink · January 23, 2020, 3:46pm

Some problems like this: Feedback using Tardigrade as normal User - #2 by jocelyn

Will do, but this is exactly why I think STORJ needs more time to properly test everything. Some bugs don’t reveal itself easily and some bugfixes introduce new bugs (like one I reported on github where nodes keep getting data although full).
Give developers and users more time to experiment, use STORJ in 3d party things like duplicati etc to make sure there are no bugs in common use-cases that customers will likely choose.

littleskunk · January 23, 2020, 3:54pm

That is not an official document and I can’t use it to create any issues. I will follow up in that thread and try to get more information what exactly didn’t work.

I don’t understand that part. The fact that we deploy a fix today that was created a week ago should be the proof that we are testing. I don’t understand which additional tests you would like to see here.

kevink · January 23, 2020, 4:01pm

No, sorry, can’t point you to any official document. I can only say that one would expect this to work if STORJ is compatible to s3.

I am not saying that you do not test, you do test. However some bugs don’t reveal itself that quickly or new ones get introduced by bugfixes. Just the normal way programs work and why you have a beta phase.

Krey · January 23, 2020, 4:06pm

We have not seen survival tests when one node or even better all the nodes on one wallet are tested for 15-30 minutes over the entire width of its channels. how many nodes throughout the network will fall off from such load? Current load of 20-40 megabits does not say anything about the reliability of nodes and their network infrastructure.

I am concerned about the reliability of the storagenode process. There are situations when the storagenode process active, but it don’t respond to network. It is not so important while disqualification by uptime is disabled. But a reasonable workaround (systemd notify calls for linux, or analog watchdog for other os) will take a little time to develop but will increase the reliability of the entire network.

Vadim · January 23, 2020, 4:39pm

@Krey
My nodes produsing 150 mbit Egress last 23 days, working fine. As i get my payout for this month will invest to better cooling.

jocelyn · January 24, 2020, 3:21am

@Krey after Town Hall, we will upload the recording to YouTube. YouTube will auto-generate a transcript. If I provide with a transcript, will that be useful? Maybe we can put the transcript in Google Translate. Just an idea

Pavmer · January 24, 2020, 7:48am

It will be very useful !
Thank you in advance…