Upgrade and Conversion Issue ( cephadm )

"Brent Kennedy" <bkennedy@xxxxxxxxxx> · Tue, 14 Jun 2022 21:53:30 -0400

Greetings mailing list!  I spent the last 2 months researching and testing
the best way to convert over to cephadm from both ceph-ansible and
ceph-deploy and this past Sunday I tried to upgrade and convert a cluster.
The upgrade from Nautilus to Octopus 15.2.16 went fine after I removed
ceph-dashboard.  All nodes came up as expected.  I then converted Octopus
over to cephadm management ( staying on octopus during the conversion ).
This particular cluster was an ansible OS prep and Ceph-deploy install.
This cluster is a centOS 7 ( updated to the latest build ) and the reason I
tried to convert octopus to cephadm managed octopus.

After successfully adopting the mons, mgrs and osds, it came time to push
out the rgw's.  Having played with quincy, I decided to create an rgw
service and manually list the nodes.  After waiting about 30 minutes, I
noticed the gateways were not loaded.  I had successfully added the rgw and
iscsi hosts just like the mons and osds but for some reason, it wasn't
pushing the image to the rgws.  When I checked podman on the rgw nodes,
there were no containers running.  The log didn't show any reason for them
not being deployed, so I thought it was the service.  I then deleted the
newly created rgw service and decided it was best to upgrade to quincy
before trying the rgw deployment since I have working cluster that used the
cephadm deployment for rgw's.  I also noticed several features missing from
the octopus dashboard, which backed up my decision to upgrade.  

I started the upgrade and then noticed the cluster was rebalancing when I
checked the status of the upgrade.  I wasn't doing that before I started the
upgrade, but since it had started that process, I decided the cancel the
upgrade to let it finish.  This is where the trouble started.  Before
stopping the upgrade, I checked the upgrade status and saw that it was still
blank.  I then stopped the upgrade.  After running into an issue with the
dashboard no longer loading, I then discovered through the versions command
that two of the mgrs had upgraded to quincy and the third had not.  The
monitors were not upgraded on any nodes ( basically just the two mgrs had
upgraded ).  The upgrade did stop and so I then waited the rebalance to
finish and tried to start the upgrade again.  

Issue at hand

The upgrade will not start again.  The rgw service while stating it was
deleting in the "ceph orch ls" command output, was not deleting.  Since two
of the mgr's upgraded, I was able to load the new dashboard by failing over
manually to the new mgr node.  The dashboard will not load the services
pages, just states 500 error.  After failing over, I cannot run the "ceph
orch ls" command, the output is this(same if failed over to 3rd mgr or not):

"

Error EINVAL: Traceback (most recent call last):

  File "/usr/share/ceph/mgr/mgr_module.py", line 1701, in _handle_command

    return self.handle_command(inbuf, cmd)

  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 171, in
handle_command

    return dispatch[cmd['prefix']].call(self, cmd, inbuf)

  File "/usr/share/ceph/mgr/mgr_module.py", line 433, in call

    return self.func(mgr, **kwargs)

  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 107, in
<lambda>

    wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
# noqa: E731

  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 96, in wrapper

    return func(*args, **kwargs)

  File "/usr/share/ceph/mgr/orchestrator/module.py", line 575, in
_list_services

    services = raise_if_exception(completion)

  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 228, in
raise_if_exception

    raise e

KeyError: 'cephadm'

"

If I fail back over to the octopus version mgr, the health output changes
from health ok to stray daemons and a failed cephadm module notice.  The
cluster is functional and servicing ok, I just cant seem to get it to do any
orchestration.  I can certainly blow away the third mgr if need be.  I also
have two more servers ready to go to make it 5 monitors but the deployment
with cephadm of the 4th monitor doesn't work right now through the
dashboard.

There are 3 mons which are also mgrs., 4 gateways and 12 osd nodes in the
cluster.  I have two more clusters to upgrade like this, so I am thinking it
would be best to jump right to quincy next time instead of messing with the
octopus dashboard, just leery of the centos 7 OS possibly causing an issue.
I wouldn't think so since this is containers; I have experience with
mesosphere and docker clusters.

Thoughts on my trainwreck?

Many thanks for reading!

Regards,

-Brent

Existing Clusters:

Test: Quincy 17.2.0 ( all virtual on nvme )

US Production(HDD): Octopus 15.2.16 with 11 osd servers, 3 mons, 4 gateways,
2 iscsi gateways

UK Production(HDD): Nautilus 14.2.22 with 18 osd servers, 3 mons, 4
gateways, 2 iscsi gateways

US Production(SSD): Quincy 17.2.0 Cephadm with 6 osd servers, 5 mons, 4
gateways, 2 iscsi gateways

UK Production(SSD): Octopus 15.2.14 with 5 osd servers, 3 mons, 4 gateways

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx