upgrade procedure to Luminous

Joao Eduardo Luis <joao@xxxxxxx> · Fri, 14 Jul 2017 15:01:25 +0100

Dear all,

The current upgrade procedure to jewel, as stated by the RC's release 
notes, can be boiled down to

- upgrade all monitors first
- upgrade osds only after we have a **full** quorum, comprised of all 
the monitors in the monmap, of luminous monitors (i.e., once we have the 
'luminous' feature enabled in the monmap).

While this is a reasonable idea in principle, reducing a lot of the 
possible upgrade testing combinations, and a simple enough procedure 
from Ceph's point-of-view, it seems it's not a widespread upgrade procedure.

As far as I can tell, it's not uncommon for users to take this 
maintenance window to perform system-wide upgrades, including kernel and 
glibc for instance, and finishing the upgrade with a reboot.

The problem with our current upgrade procedure is that once the first 
server reboots, the osds in that server will be unable to boot, as the 
monitor quorum is not yet 'luminous'.

The only way to minimize potential downtime is to upgrade and restart 
all the nodes at the same time, which can be daunting and it basically 
defeats the purpose of a rolling upgrade. And in this scenario, there is 
an expectation of downtime, something Ceph is built to prevent.

Additionally, requiring the `luminous` feature to be enabled in the 
quorum becomes even less realistic in the face of possible failures. God 
forbid that in the middle of upgrading, the last remaining monitor 
server dies a horrible death - e.g., power, network. We'll be left with 
still a 'not-luminous' quorum, and a bunch of OSDs waiting for this flag 
to be flipped. And not it's a race to either get that monitor up, or 
remove it from the monmap.

Even if one were to make the decision of only upgrading system packages, 
reboot, and then upgrade Ceph packages, there is the unfortunate 
possibility that library interdependencies would require Ceph's binaries 
to be updated, so this may be a show-stopper as well.

Alternatively, if one is to simply upgrade the system and not reboot, 
and then proceed to perform the upgrade procedure, one would still be in 
a fragile position: if, for some reason, one of the nodes reboots, we're 
in the same precarious situation as before.

Personally, I can see two ways out of this, at different positions in 
the reasonability spectrum:

1. add temporary monitor nodes to the cluster, may they be on VMs or 
bare hardware, already running Luminous, and then remove the same amount 
of monitors from the cluster. This leaves us to upgrade a single monitor 
node. This has the drawback of folks not having spare nodes to run the 
monitors on, or running monitors on VMs -- which may affect their 
performance during the upgrade window, and increase complexity in terms 
of firewall and routing rules.

2. migrate/upgrade all nodes on which Monitors are located first, then 
only restart them after we've gotten all nodes upgraded. If anything 
goes wrong, one can hurry through this step or fall-back to 3.

3. Reducing the monitor quorum to 1. This pains me to even think about, 
and it bothers me to bits that I'm finding myself even considering this 
as a reasonable possibility. It shouldn't, because it isn't. But it's a 
lot more realistic than expecting OSD downtime during an upgrade procedure.

On top of this all, I found during my tests that any OSD, running 
luminous prior to the luminous quorum, will need to be restarted before 
it can properly boot into the cluster. I'm guessing this is a bug rather 
than a feature though.

Any thoughts on how to mitigate this, or on whether I got this all wrong 
and am missing a crucial detail that blows this wall of text away, 
please let me know.

  -Joao
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html