Are automatic node downgrades expected?

Roxor · April 23, 2024, 11:38am

When a new version comes out… it looks like a couple nodes initially get it (as expected)… most stay on the previous version (as expected)… and some go backwards a version (unexpected)?

For example: when 1.101.3 came out… I saw my nodes slowly upgrade over time from 1.99.3 (I think there were a couple 1.100’s in there: but that version got pulled?). As of yesterday all my nodes were 1.101.3 (I still have browser tabs open showing them). Yet this morning I see a few of my nodes (5%?) on 1.102.3 (which is fine), but many of them went backwards to 1.99.3 (about 15%?).

So I went from 100% 1.101.3… to 15% 1.99.3 / 80% 1.101.3 / 5% 1.102.3 (rough numbers)

Should the auto-upgrade feature ever take a node backward a version?

I think I saw this moving from 1.95.x as well, but didn’t record anything, and assumed I was crazy. Note: some nodes get restarted if their Internet connection flakes: so they may have restarted-to-recheck-versions… but I haven’t checked if all the ones that downgraded did get restarted.

Update: I think the downgraded nodes do correspond to ones that were restarted. And the script I have fixing things does a docker stop/remove/compose and not a stop/compose. So it’s entirely possible the system is doing this to itself by wiping out any intermediate versions so upon restart the node only thinks it should be at the minimum version

nerdatwork · April 23, 2024, 11:44am

@elek Please have a look at this.

Roxor · April 23, 2024, 11:47am

I don’t really know what I’m talking about: but it feels like the upgrade logic says “If you’re not on the list for the highest version… then use the lowest version” (and ignores any newer-than-minimum version in-between, so it rolls it back)?

snorkel · April 23, 2024, 11:52am

I don’t know exactely how update works, but I only saw downgrades when I removed and recreated the container, because I changed some parameter. So if you just restart the machine, I expect not to be downgraded, but only restart with the same version or newer. This is my logic, but maybe I’m wrong.

Roxor · April 23, 2024, 11:56am

Yes I just had to make an edit to my initial post: the script I have repairing the node is removing them (which is wrong). So I can understand why fresh nodes may be at a lower version.

Marking this is resolved, since apparently I was shooting myself in the foot. Leaving it up for others to find in the future

nerdatwork · April 23, 2024, 1:14pm

This was seen after 1.97.2 was starting to crash nodes after updating. Storjlings downgraded the minimum allowed version to stop the rollout. Any other time if you stop/remove/start the container emphasis on “remove” it should NOT downgrade your node’s version. I think the logic to revert node back to the allowed minimum version is still implemented which is why your node(s) were downgraded. I would wait for Elek to give his expert comments/solution.

Mitsos · April 23, 2024, 1:59pm

The reason some nodes were downgraded was because the cursor was missing, not because the minimum version was lowered.

Is there a cursor present?

Yes
- Does it match my node’s group?
  1. Yes > Upgrade to rollout version
  2. No > Do nothing (“a newer version is being rolled out but hasn’t reach your node yet” or something like that gets logged)
No > download the minimum version

The cursor was blank when 1.97 was stopped, and recently it was blank for a few hours before 1.102 started being rolled out.

striker43 · April 23, 2024, 3:39pm

I saw the exact same behavior today. All my nodes were 1.101.3 and after a server restart I had some with 1.99.3, some with 1.101.3 and some with 1.102.3…
(All docker nodes)

Roxor · April 23, 2024, 4:00pm

Interesting! I thought it was because I was restarting them the wrong way that some downgraded. Thanks for letting us know.

Mark · April 23, 2024, 11:27pm

Same thing happened to me. According to my notes, My docker node was running 1.101.3 on April 17. At some point in the last couple days it downgraded itself to v1.99.3. Like other people in this thread, I have also recently restarted my node after changing a setting.

Does the built in node updater or watch tower generate any logs that would help explain what is going on?

nerdatwork · April 24, 2024, 12:48am

docker logs watchtower can show you the watchtower log

Alexey · April 24, 2024, 5:16am

The watchtower would update only a base image, which is rare, but not the node inside. The container will download a new version accordingly version.storj.io and cursor in it, so I guess @Mitsos is right.
Pinged the team.

elek · April 24, 2024, 9:29am

No, they are not. Only in case of emergency, which is rare.

This is an excellent comment, and full comment is an excellent problem description.

Before ~~1.102~~ 1.101, accidentally the ~~1.101~~ 1.100 rolled out for 6% of the nodes. It didn’t have the GC fix, so the rollout has been stopped. But instead of directly starting rollout of ~~1.102~~ 1.101, for a shorter period of time, it was reverted to the original version.

We will do our best to avoid similar downgrades in the future.

PiotrZSL · April 24, 2024, 10:11pm

This happen today to me.
Had both nodes (during on same HW under docker) running on v1.102.2.

Decided to move database from HDD to SSD (by using instruction found on forum).
One node downgraded to 1.99.3, other upgraded to 1.102.3.

BrightSilence · April 25, 2024, 7:09am

I have serious doubt abouts that number, since all my 14 nodes were updated to v1.101.3. That is technically possible if only 6% were updated, but at a chance of 0.00000000000000078% I would consider that a statistical impossibility.

Roberto · April 25, 2024, 7:20am

I have 3 nodes all at version 1.101.3

elek · April 25, 2024, 11:40am

Sorry, I was confused by the too many version. I fixed my original post:

It happened between v1.99.3 and v1.100.3 (stopped after 6%, and cursor set back to 0). v.101.3 was fully rolled out…

BrightSilence · April 25, 2024, 11:44am

Do you know why the minimum version wasn’t bumped on https://version.storj.io/ to v1.101.3 before rollout of v1.102.3 started, because this is what’s causing the rollbacks now on many nodes that restart. Seems like a mistake.

elek · April 25, 2024, 11:57am

No, I don’t know. But I think it won’t be a problem, without re-setting the cursor back to zero…

I checked the history, and it’s not always bumped together with the start of the rollout.

TBH, I don’t know the best strategy, as there should be some grace period before making the update mandatory…

It’s also possible to save the last used version and disable downgrade, but I think downgrade can be helpful in case of a very nasty bug…

BrightSilence · April 25, 2024, 12:03pm

It was reset though. The cursor is now 3fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff. So anyone who isn’t selected for v1.102.3 yet will be downgraded to v1.99.3 on a restart of the docker container.

It is my understanding that this version isn’t used to show what’s mandatory, but I’m not entirely certain. Last time I heard about it the version for which ingress stopped wasn’t visible here.

I agree, this should be managed by you guys. There are definitely scenarios where a downgrade (when intentional) should be able to propagate to all nodes. But this one is extra painful as nodes which migrated to date based trash folders will be downgraded to a node version unaware of those. This means that 1.99 won’t clean up trash in date based folders and likely the migration won’t work again when upgrading to 1.102, because the file signaling that date based trash is used already exists. Leaving trash generated by 1.99 in the mean time to possibly stick around forever.