Octopus: conversion from ceph-ansible to Cephadm causes unexpected 15.2.15→.13 downgrade for MDSs and RGWs

Florian Haas <florian@xxxxxxxxxxxxxx> · Thu, 16 Dec 2021 14:18:15 +0100

Hello everyone,

my colleagues and I just ran into an interesting situation updating our 
Ceph training course. That course's labs cover deploying a Nautilus 
cluster with ceph-ansible, upgrading it to Octopus (also with 
ceph-ansible), and then converting it to Cephadm before proceeding with 
the upgrade to Pacific.

When freshly upgraded to Octopus with ceph-ansible, the entire cluster 
is at version 15.2.15. And everything that is then being adopted into 
Cephadm management (with "cephadm adopt --style legacy") gets containers 
running that release. So far, so good.

When we've completed the adoption process for MGRs, MONs, and OSDs, we 
proceed to redeploying our MDSs and RGWs, using "ceph orch apply mds" 
and "ceph orch apply rgw". Here, what we end up with is a bunch of MDSs 
and RGWs running on 15.2.13. Since the cluster previously ran 
Ansible-deployed 15.2.15 MDSs and RGWs, that makes this a partial (and 
very unexpected) downgrade.

The docs at https://docs.ceph.com/en/octopus/cephadm/adoption/ do state 
that we can use "cephadm --image <image>" to set the image. But we don't 
actually need that when we invoke cephadm directly ("cephadm adopt" does 
pull the correct image). Rather we'd need to set the correct image for 
deployment by "ceph orch apply", and there doesn't seem to be a 
straightforward way to do that.

I suppose that this can be worked around in a couple of ways:

* by following the documentation and then running "ceph orch upgrade 
start --ceph-version 15.2.15" immediately after;
* by running "ceph orch daemon redeploy", which does support an --image 
parameter (but is per-daemon, thus less convenient than running through 
a rolling update).

But I'd argue that none of those additional steps should actually be 
necessary — rather, "ceph orch apply" should just deploy the correct 
(latest) version without additional user involvement.

The documentation seems to suggest another approach, namely to use an 
updated service spec, but unfortunately that won't work as we can't set 
"image" that way. Example for the rgw service:

---
# rgw.yml
service_type: rgw
service_id: default.default
placement:
  count: 3
image: "quay.io/ceph/ceph:v15"
ports:
  - 7480

# ceph orch apply -i rgw.yaml
Error EINVAL: ServiceSpec: __init__() got an unexpected keyword argument 
'image'

So, we're curious what's the correct way to ensure that "ceph orch 
apply" installs the latest Octopus release for MDSs and RGWs being 
redeployed as part of a Cephadm cluster conversion. Or is this simply a 
bug somewhere in the orchestrator that would need fixing?

Cheers,
Florian

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx