Suggestion on testing on production method

Hi Storj, I’ve an idea for quicker identify potential problem in the future, I’ve post this in Current situation with garbage collection - #246 by BrightSilence, but will repost here for better visibility.

I used to be a SWE, the hardest thing in software testing (I think) is limiting the unknown variables could interfere with the result. If something keep changing when you run between test, you can never get a reliable result and might never able to find the root cause.

To my credibility, I used to report bug to cockroachdb database Issues · cockroachdb/cockroach · GitHub, there are some nasty bug can’t be fix if I don’t send them my dataset at a particular time - because one day the SQL query throwing panic, the other day the SQL query is running just fine, to fix that panic - I need a backup that could reproduce that query bug so they could investigate further.

I’ve also work as DevOps engineer, I know the pain to supporting SWE to reproduce a bug - I can’t let them query a random query on production, but I also don’t want to block them trying to reproduce a bug.

Upon seeing the the problem with GC, I’d love to hear your thought about a new approach of testing in production I’ve come up with:

  • Storj should run a few nodes on a filesytem that support snapshot - so you guy could travel back in time at the particular time when the issue is visible and lock it in place.
  • CockroachDB backup could be restore to a particular time - and it could also be lock in place.

With that amount of control over environment variable, I think it substantial easier to reproduce bugs that could happen in the future.

To make it much easier for developer, devops team could build a button, when it clicked, a process will run to auto restore both filesystem/database and give it to developer readily.

Thank for reading through this!

3 Likes

Thank you for your idea!
We use a storj-up tool in our tests, and yeah, the CocroachDB in particular.
So, the snapshot likely not needed - you always can do

docker compose down

and

docker compose up -d

down the road. However, this is usually a manual test. We have an automatic tests of course, but they may not capture some bugs.

You may also learn, what this tool is capable to do (it’s incredible in my opinion as a DevOps engineer), it can help to build a Storj network in a few commands directly on your host, using either pre-built images or build it yourself!

Hi there, let’s clear a few thing up:

  1. storj/up was not getting the love it needed, I basically only see dlamarmorgan use it frequently (could you confirm that that I’m wrong? I hope that I’m wrong)
  2. If this is a manual test - it not good enough, if you need to manually feed the data into the network, then it is a very expensive way to test. Everytime you need to test something, you need to spin up the network, manually think of the hypothesis where thing could go wrong, write the scenario to test, and then test it, and then tear down the cluster. With that amount of work, I think the (possibly) misconception here is that - you think using storj/up is more cheaper than run a real node. I’d disagree, run a real node in this case is much much cheaper to test - you disagree or you agree?.
  3. Some nasty bug only start to emerge in a long run, and if this is the key, possibly you guy have exhausted the bug list that detectable when run for a few hours, but do you agree that - in the dynamic of new version of satellite/node, power on/off, this test only useful if you know what to test already and it hardly discorver edge case?

I mean, think about the idea of freezing real condition of storj node and other component real time, on demand, with one button, isn’t that much cheaper to test and will of course reveal much more? Let me hear your thought if you are hesitate.

This should be an automatic test as far as I know, but also it is widely used by our developers and QA engineers to test things.
I wouldn’t say that all tests are manual, however, they probably exist (i’m not involved to confirm).

It’s inexpensive, because you run the whole Storj network locally on your machine (or in the runner, if that’s an automatic test), it doesn’t use the public or a Storj Select network. However, we have some tests on a public network too (see a Saltlake satellite).

this is true. However, I do not think that we perform such tests except manual tests by QA engineers.

This is very simple with storj-up again. You could do not execute docker compose down between updates. You may also save a snapshot with a docker commit command anytime. So, we have this ability too without creating a VM snapshots or filesystems snapshots (this would require to use a specific FS, but our team is different enough (I mean different OSes and platforms, like ARM and x86 and also “special” ARM-like apple’s ones) to find problems).
However, I like that you are interested in a deep dive in the CI/CD processes in our company!

I’m glad we are in disagreement, without this we can’t come to conclusion. How about this, you show this topic to some of core engineers, it won’t take them much time to read. If the team come to conclusion - it not a fit, then I won’t push it further. My thought is that, manually spin up cluster is more expensive and also do not cover enough cases in real situation.

Again, this is also misconception - number 2, your docker snapshots require an engineer effort, either run a command or click button, it take a lot of effort, it also take effort to recall what to with it later on, I think it going to spend the rest of their life on some s3 bucket never to be seen again and waiting for deletion (I’ve been an SWE you know).

Of course I did that :slight_smile:
Just they will not respond in non-business hours. I just expresses my opinions as a DevOps engineer.
This is oxymoron, because DevOps is not a title, but a mindset and methodology… but well, we are placed to a HR vision…

It’s not a cluster, it’s a simple docker compose up -d command (and, as a DevOps, I would never suggest to use this as a production solution, including the docker swarm, sorry!).

I have to agree. But I believe that we do not have a long-live “small “cluster”” for test purposes except Saltlake (it’s public, by the way), so this is why I passed your suggestion to the team.

This is easily solved with

uplink rm --recursive --encrypted --parallelism 30 sj://my-bucket

Then you may remove the bucket without a timeout:

uplink rb --force sj://my-bucket
1 Like

Hi @Alexey, I bet team is super busy with the current GC issue, but this is an invesment, and only take as low as 2 months to start to reap off the benefits (I hope). If chasing bug after bug and don’t invest, the next time you need freeze the world ability, it not there… Have team think about this and have conclusion yet - so I could mark as solution and move on.

I’m sure they prioritize features, which should be implemented first, thank you for your suggestions!
I also know that we have a testing pipeline as well, but it’s always good to have ideas how to improve, thanks!

I mean that kind of clear where priority are - new features - or dare I say - Storj priority, the GC issue if not because of SNO pressure, I doubt it get push into fixing.

The correctness of the software is not a priority is a shame, I hope it won’t be come a cerberus in the future because of many layer of bugs.

And I sympathise with you and developer, as a bridge of the community, get pushed from both side, I hope you could take SNOs side a bit more though…

Developers are already implementing the GC fix, this has been discussed extensively on other threads, for example here.

So I don´t know why you insinuate they are not giving this issue the priority it deserves. You can follow the progress on github, for example, check the related commits included in the last few releases which are also posted here on the forum.

1 Like

The issue have smoke from Dec 2023 https://forum.storj.io/t/disk-usage-discrepancy and not from the link a month ago you give (this is when developer was fully aware of this).

The early suggestion from developer was this Disk usage discrepancy? - #187 by elek. But let me ask you this - is there major effort to ensure byte by byte storage size correctness for SNOs? I mean as a genuine question, compare that effort to the effort of ensure data free of corruption for customer…

I’m going to try to bring this conversation together a little bit. I think the testing methodology Storj currently uses is working to detect whether functionality works as intended on a small scale network, but that leaves a gap. Some issues only arise under large capacity or heavy load scenarios. The GC issues are an example of this. You’re never going to find this problem by running Storj up to verify functionality unless you also ensure that you push production scale amounts of pieces to that test setup as well. And that doesn’t seem sustainable.

Now I do want to add one caveat to the argument that the GC issue was playing out since December 2023. That’s not entirely true. There are many reasons there can be a discrepancy between data reported by the node and data reported by the satellite and initially it certainly seemed to be mostly caused by issues with the local setup of the node. This also made it harder to realize when there actually did seem to be an issue with the actual GC process as it was hiding among other problems causing the same symptoms. Storj Labs responding to this based on community feedback is a blessing and a curse. On the one hand, they may not catch certain high capacity/high load related issues until node operators notice them in the wild. On the other hand, there is an awesome community here more than willing to report and help out with these issues and frequent attention from Storj devs on anything we have to report.

In the end, I’ve always felt heard if I had a serious concern and they’ve always been addressed. That doesn’t mean testing can’t be improved, from what I can tell, it can be. And getting better testing processes in place to test all components under production like scale might be a gap that needs to be filled. But in the mean time, I do appreciate the open communication and the way Storj works with the community to resolve issues.

4 Likes

To be honest, there are several issues with GC and filewalkers and how they were designed.
And we saw it long before December 2023 that something did not fit:

or

Today we know why… These nodes are getting cleared of their waste just these days now with over a year getting not paid accordingly and not getting ingress while only half full in reality…
But the lack of easy tools to investigate and monitor made it impossible to sort it out.
That’s the problem if you make it hard for SNOs and make them dig into logs and APIs and crazy debug information. Then it is not easy to investigate problems too.
If we had it in the dashboard what the satellites thinks the node should have and what the nodes thinks it has, then we would have been able to spot the problems much earlier before December 2023.

This is called software development. There is no such thing as an easy shortcut. You can’t press one button and magically all the bugs reveal them self. If it would be that easy they wouldn’t pay me for doing that job…

Take the recent 1024 vs 1000 bug for example. Is it really a problem of tooling? This one can be spotted with the help of a calculator and still it was in the dashboard for a year or so. Sometimes the simple bugs are the hardest to spot because nobody expects them in the first place.

3 Likes

That true to some extends, but imagine this: your livelihood don’t come from salary but solely from running SNOs, would you see this issue differently - you see where I’m getting at?

Let see if anyone complain vs I'll make sure no one can complain - because this is also my business, it two different mindset - two different priority. I’d really really appreciate put oneself in the shoes of SNOs.

Have a nice day, gentlements.

And let put this into another perspective, this issue already look like a civil unrest, it would hurt and set back storj if too many issues pop up in the future - it not only for the benefit of SNOs, but for Storj own benefit as well.

That 1000 vs 1024 bug has a screenshot from my storage node. So much on that topic…

3 Likes

I can’t help picture something like this…


(The O in SNO stands for operator)

There will always be things that will be missed. But I generally agree with this approach. However, it is a balancing game. Spend too much time testing and development slows down significantly too. This is why I also have a test node running, at least that way I know my systems can run the new version before it rolls out. But again that test is small scale and I won’t catch everything before my nodes migrate either.

And I can’t help but think the current approach also means there is a lot of back and forth between SNOs and Storj, which I personally enjoy a lot. Then again, it’s hobby with a modestly decent return for me. I don’t depend on node income, but I also haven’t seen many or even any SNOs that do on this forum.

1 Like