Release preparation v1.40

We are targetting an open development process. One aspect of it is a more transparent release process.

We start with the list of all commits. Up next we go through that list and filter out the important commits and also decide which release tests we want to execute.

python branch.py release-v1.39 main
[- e16c2c94] web/satellite: Route all child routes to object browser (#4206)
[- 4c5a18d4] web/: disable storj linter
[- b2d35aa2] Point release v1.39.1-rc (#4193)
[+ b2d72496] cmd/storagenode-updater: avoid depending on the storagenode code
[+ 6e660cec] Jenkinsfile: test cross-compile and bump deps
[+ 09e1ff7f] web/satellite: fix promise usages
[+ 252b7858] satellite/console: add status check to user authorization to ensure deleted accounts cannot perform actions
[+ 9153b221] testsuite/ui/satellite: updated pnboarding CLI tests and improved selectors
[+ 5e4b196b] satellite/metainfo: finish move object
[+ 32cee1e5] satellite/metabase/segmentloop: ensure we shutdown on ctx err
[+ 38366914] satellite/metainfo: add position to begin move results
[+ df09e7d1] satellite/metainfo: ensure storagenodes finish work for test
[+ 36911b44] satellite/accounting/tally: make tests faster
[+ 71eb184e] storagenode/piecestore: simplify TestTooManyRequests
[+ cae08d81] satellite/metabase: FinishMoveObject segment query improved
[+ 99914dfc] web/satellite: fix for button label uppercase issue
[+ 0e1c3cb8] docs: add contributing guide for SNO development
[+ 030ab669] web/satellite: Route all child routes to object browser
[+ 469ae72c] satellite/repair: update audit records during repair
[+ 7d90770f] .github: add issue templates
[+ 9da3de1a] web/satellite, testsuite/satellite/ui: removed onb CLI flow’s Generate AG step and updated tests
[+ 4fefa36a] satellite/metabase: NewBucket added to metabase/metainfo FinishMoveObject methods
[+ 0330871a] satellite/metabase: add missing FinishMoveObject parameter
[+ 7fd34669] ci: remove deleting workspace steps
[+ 80366444] ci: try fix builds
[+ 118e64fa] jenkinsfile: bump build timeout
[+ 1ed5db14] satellite/metainfo: simplifying limits code
[+ d397f6e2] docs: move contribution guide to repository root
[+ 40084d51] satellite/metainfo: reduce number of testplanet runs
[+ 512d0d61] satellite/metrics: speedup test
[+ 9c232ceb] ci: make deletion a separate step
[+ cc9b845a] ci: switch back to pulling gateway@main
[+ 4db80773] satellite/satellitedb: add burst_limit for project
[+ 6d3fd33c] satellite/metabase/segmentloop: start immediately on manual trigger
[+ ab7e81c2] satellite/accounting: update GetProjectBandwidthTotals function to use the less expensive project_bandwidth_rollups table
[+ c911360e] satellite/metainfo: separate burst limit from rate limit config
[+ a16aecfa] satellite/payments: specialized type for monetary amounts
[+ c053bdbd] satellite/satellitedb: prepare to remove big.Float from db
[+ 0d58172c] storagenode: add doc.go files for sno packages
[+ f52f5931] ci: don’t fail when nothing to delete
[+ f52f5931] ci: don’t fail when nothing to delete
[+ 1def7b0e] web/satellite, testsuite/ui/satellite: added tests for invalid sign up credentials and satellites dropdown
[+ 8e3d7d30] web/satellite, testsuite/ui/satellite: added tests for onb CLI os tabs switching
[+ 46d1b4df] testsuite/ui/satellite: added test for restart onb tour functionality
[+ f829d64a] web/satellite, testsuite/ui/satellite: updated onb Welcome screen’s encryption info and added tests
[+ d5043a0f] web/storagenode: dcs word removed from gui
[+ ead310d3] satellite/satellitedb: avoid running migrate tests concurrently
[+ 5b661363] cmd/uplinkng: fix mb command
[+ 3dbd4434] web/satellite: don’t ignore Vue errors and warnings
[+ 0209bc6e] cmd/uplink: add mv command

10 Likes

Nice! More transparency = better :+1:

2 Likes

Based on the list of commits we decided to test the following things.

Server Side Move
As far as we know the new uplink binary should now have a move command. That should allow us to rename a file, rename a folder or move a file between folders. Moving files into a different bucket shouldn’t work but let’s test that one as well just to make sure the user gets a meaningful error message back.

Audit Penalties for Repair Failures
The audit job is checking a random 1 KB stripe out of a piece. The repair job is checking the entire piece. For that reason, it is possible that the repair job detects a corrupted piece while the audit job believes it is all fine. We decided to enable the repair job to also update the audit score. The corresponding commit contains a lot of automated tests but we don’t have an integration test yet. The risk here is that the repair job might update but not persist the audit score or some other fancy combinations. → We will run a quick test in storj-sim.

Storage Node Updater
It is just a small code change and should have no effect at all. However, a mistake here can have a huge impact and risk the stability of the entire network. → Same deal. A quick test to make sure the updater has the same behavior as before.

Satellite UI File Browser
The user growth team is currently working on a new onboarding workflow. The new onboarding workflow is hidden behind a feature flag that we haven’t enabled on the production satellites yet. The user growth team is also working on updating the file browser component. The QA team has run a few tests with storj-sim over the week with mixed results. Especially the file browser was causing some trouble. The strange thing is there is no commit in the list that would explain it. We also know that this can be caused by a very tiny code change that sneaked in with a different commit. → Test the file browser on the QA satellite as the final quality gate. Last chance to raise question.

Anything missing? What would you test based on that list of commits and why?

3 Likes

There is actually a blind spot in the test automation. All repair tests only verify audit score decrease but not a single one is checking for audit socre increase. This leaves a small gap that can get all storage nodes disqualified. Take a big note that failes 1% of the repair downloads. Expected behavior: Repair job should decrease audit score for 1% of the piece downloads but increase it for the other 99% successful downloads. Actual behavior might be: If the repair job doesn’t increase the audit score it is just a question of time until the 1% repair failures stack up and get the storage node disqualified.

To mitigate this risk I have to:

  1. Ask the durability team to please add the missing tests steps. If I can see the test passing before deployment on monday that should work.
  2. In parallel extend our storj-sim test a bit. It takes only a few more minutes to test this.
  3. Worst case revert the commit and deploy the old repair job.
6 Likes

Normaly we would deploy the satellites right now. @Andrii found a bug this morning on the QA satellite. At the moment customers can’t pay with STORJ token. Bugfix incoming. We will cherry pick it and deploy it on the QA satellite today or tomorrow. Next window for production deployment in 24 hours.

3 Likes

Will the repair process only update audit score on conditions that would fail an audit (corrupt data or file not found error) or will it also update the audit score in case my node “lost the race” / “download canceled”?

I think that was already answered. It should decrease on failure and increase on success.

The sentence that follows that is just a possible fault scenario they might want to create tests for so that faulty code doesn’t go to production.

Edit: I just noticed I answered a question you didn’t ask… my bad.

Repair job has no long tail cancelation. If you hit the timeout it should be treated similar to an audit timeout.

Is that a relatively recent change? Because I have some log entries like this:

2021-08-28T04:37:06.300Z       INFO    piecestore      download canceled       {"Piece ID": "ADVC6AQEW6KKAGDSQ25VVVU6DF6A7E5FLUI7NSYRY6KIDKGLWAIQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR"}
2021-08-28T04:37:26.284Z       INFO    piecestore      download canceled       {"Piece ID": "KSQMJXIOBGGIYTDAL5XSP65SHGCXGGLD4BLRDKZJ647P27XBXAQQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR"}

Does this mean:

  1. My node timed out, it should not have done that, audit score should go down? In this case the message is misleading (because it looks like normal cancelation).
  2. This was changed and now GET_REPAIR downloads are no longer canceled?

Without further information, I would still say yes that looks like a timeout and it would reduce the audit score. As long as this only happens every now and than we would also expect a similar amount of repair success that will compensate for it.

For more details you could search for the download started message. If it is 5 minutes apart that is a timeout for sure. If it is not 5 minutes then next question would be if your ISP had a reconnect or your storage node might have restarted. That might be hard to find out. The last option would be that the repair job crashed. That will cancel the connection but it wouldn’t apply the penalty.

2021-08-28T04:37:06.243Z        INFO    piecestore      download started        {"Piece ID": "ADVC6AQEW6KKAGDSQ25VVVU6DF6A7E5FLUI7NSYRY6KIDKGLWAIQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR"}
2021-08-28T04:37:06.300Z        INFO    piecestore      download canceled       {"Piece ID": "ADVC6AQEW6KKAGDSQ25VVVU6DF6A7E5FLUI7NSYRY6KIDKGLWAIQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR"}
2021-08-28T04:37:26.239Z        INFO    piecestore      download started        {"Piece ID": "KSQMJXIOBGGIYTDAL5XSP65SHGCXGGLD4BLRDKZJ647P27XBXAQQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR"}
2021-08-28T04:37:26.284Z        INFO    piecestore      download canceled       {"Piece ID": "KSQMJXIOBGGIYTDAL5XSP65SHGCXGGLD4BLRDKZJ647P27XBXAQQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR"}
2021-08-28T19:35:49.741Z        INFO    piecestore      download started        {"Piece ID": "ADVC6AQEW6KKAGDSQ25VVVU6DF6A7E5FLUI7NSYRY6KIDKGLWAIQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR"}
2021-08-28T19:35:51.373Z        INFO    piecestore      downloaded      {"Piece ID": "ADVC6AQEW6KKAGDSQ25VVVU6DF6A7E5FLUI7NSYRY6KIDKGLWAIQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR"}
2021-08-28T19:35:53.787Z        INFO    piecestore      download started        {"Piece ID": "KSQMJXIOBGGIYTDAL5XSP65SHGCXGGLD4BLRDKZJ647P27XBXAQQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR"}
2021-08-28T19:35:54.731Z        INFO    piecestore      downloaded      {"Piece ID": "KSQMJXIOBGGIYTDAL5XSP65SHGCXGGLD4BLRDKZJ647P27XBXAQQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR"}

There were more than those two canceled repairs at about the same time.

Thank you for your feedback. I will ask the developer team how that can happen and what consequences that might have.

One other potential issue would be the number of concurrent repair requests. The audit job itself has a built-in limit. A single mistake wouldn’t get you disqualified right away because you have to fail more than 1 audit request. You can have enough repair requests at the same time so that you might get disqualified for a single mistake. All the concurrent repair requests will fail at the same time and that might have a bigger impact.

Satellites have been deployed yesterday and the storage node rollout will start soon: Changelog v1.40.4

1 Like