Hi Storj, I’ve an idea for quicker identify potential problem in the future, I’ve post this in Current situation with garbage collection - #246 by BrightSilence, but will repost here for better visibility.
I used to be a SWE, the hardest thing in software testing (I think) is limiting the unknown variables could interfere with the result. If something keep changing when you run between test, you can never get a reliable result and might never able to find the root cause.
To my credibility, I used to report bug to cockroachdb database Issues · cockroachdb/cockroach · GitHub, there are some nasty bug can’t be fix if I don’t send them my dataset at a particular time - because one day the SQL query throwing panic, the other day the SQL query is running just fine, to fix that panic - I need a backup that could reproduce that query bug so they could investigate further.
I’ve also work as DevOps engineer, I know the pain to supporting SWE to reproduce a bug - I can’t let them query a random query on production, but I also don’t want to block them trying to reproduce a bug.
Upon seeing the the problem with GC, I’d love to hear your thought about a new approach of testing in production I’ve come up with:
- Storj should run a few nodes on a filesytem that support snapshot - so you guy could travel back in time at the particular time when the issue is visible and lock it in place.
- CockroachDB backup could be restore to a particular time - and it could also be lock in place.
With that amount of control over environment variable, I think it substantial easier to reproduce bugs that could happen in the future.
To make it much easier for developer, devops team could build a button, when it clicked, a process will run to auto restore both filesystem/database and give it to developer readily.
Thank for reading through this!