Re: Cephadm Upgrade Issue

Eugen Block <eblock@xxxxxx> · Wed, 14 Aug 2024 17:06:34 +0000

A few of our customers were affected by that, but as far as I remember  
(I can look it up tomorrow), the actual issue popped up if they had  
more than two MGRs. But I believe it was resolved in a newer pacific  
version (don’t have the exact version on mind), which version did you  
try to upgrade to? There shouldn’t be any reason to remove other  
daemons.

Zitat von "Alex Hussein-Kershaw (HE/HIM)" <alexhus@xxxxxxxxxxxxx>:

I spotted this: Performing a `ceph orch restart mgr` results in  
endless restart loop | Support |  
SUSE<https://www.suse.com/support/kb/doc/?id=000020530>, which  
sounded quite similar, so I gave it a go and did:

ceph orch daemon rm mgr.raynor-sc-1
< wait a bit for it to be created >
< repeat for each host >

That seemed to solve my problem. I upgraded and it just worked.

Did get me wondering if I should be doing the same for my monitors  
(and even OSDs) post-adoption? They do seem to have a different  
naming scheme.

Best Wishes,
Alex

________________________________
From: Alex Hussein-Kershaw (HE/HIM)
Sent: Wednesday, August 14, 2024 3:06 PM
To: ceph-users <ceph-users@xxxxxxx>
Subject: Cephadm Upgrade Issue

Hi Folks,

I'm prototyping the upgrade process for our Ceph Clusters. I've  
adopted the Cluster following the docs, that works nicely 🙂 I then  
load my docker image into a locally running container registry, as  
I'm in a disconnected environment.  I have a test Cluster with 3 VMs  
and no data, adopted at Octopus and upgrading to Pacific. I'm  
running a MON, MGR, MDS and OSD on each VM.

I then attempt to upgrade:
ceph orch upgrade start --image localhost:5000/ceph/pacific:v16.2.15

Lots of logs below, but the summary appears to be that we initially  
fail to upgrade the managers and get into a bad state. It looks like  
there is some confusion in manager naming, and we end up with two  
managers on each machine instead of one. Eventually Ceph reports a  
health warning:

$ ceph -s
  cluster:
    id:     e773d9c2-6d8d-4413-8e8f-e38f248f5959
    health: HEALTH_ERR
            1 failed cephadm daemon(s)
            Module 'cephadm' has failed: 'cephadm'

That does seem to eventually clean its self up and the upgrade  
appears to have completed ("ceph versions" shows everything on  
Pacific), but it feels a bit bumpy. Hoping someone has some guidance  
here. The containers on one host during upgrade are shown below.  
Notice I somehow have two managers, where the names are a single  
character different (a "-" replaced with a "."):

$ docker ps | grep mgr
2143b6f0e0e6   localhost:5000/ceph/pacific:v16.2.15               
"/usr/bin/ceph-mgr -…"   About a minute ago   Up About a minute       
       ceph-e773d9c2-6d8d-4413-8e8f-e38f248f5959-mgr.raynor-sc-2
59c8cfddac64   ceph-daemon:v5.0.12-stable-5.0-octopus-centos-8    
"/usr/bin/ceph-mgr -…"   14 minutes ago       Up 14 minutes           
       ceph-e773d9c2-6d8d-4413-8e8f-e38f248f5959-mgr-raynor-sc-2

In the output of "ceph -w" I see this sort of stuff:

2024-08-14T13:45:13.003405+0000 mon.raynor-sc-1 [INF] Manager daemon  
raynor-sc-3 is now available
2024-08-14T13:45:23.179699+0000 mon.raynor-sc-1 [ERR] Health check  
failed: Module 'cephadm' has failed: 'cephadm' (MGR_MODULE_ERROR)
2024-08-14T13:45:22.372376+0000 mgr.raynor-sc-3 [ERR] Unhandled  
exception from module 'cephadm' while running on mgr.raynor-sc-3:  
'cephadm'
2024-08-14T13:45:24.761961+0000 mon.raynor-sc-1 [INF] Active manager  
daemon raynor-sc-3 restarted
2024-08-14T13:45:24.766395+0000 mon.raynor-sc-1 [INF] Activating  
manager daemon raynor-sc-3
2024-08-14T13:45:31.800989+0000 mon.raynor-sc-1 [INF] Manager daemon  
raynor-sc-3 is now available
2024-08-14T13:45:32.874227+0000 mon.raynor-sc-1 [INF] Health check  
cleared: MGR_MODULE_ERROR (was: Module 'cephadm' has failed:  
'cephadm')
2024-08-14T13:45:32.874269+0000 mon.raynor-sc-1 [INF] Cluster is now healthy
2024-08-14T13:45:33.664602+0000 mon.raynor-sc-1 [INF] Active manager  
daemon raynor-sc-3 restarted
2024-08-14T13:45:33.671809+0000 mon.raynor-sc-1 [INF] Activating  
manager daemon raynor-sc-3
2024-08-14T13:45:34.050292+0000 mon.raynor-sc-1 [INF] Manager daemon  
raynor-sc-3 is now available
2024-08-14T13:45:38.260385+0000 mon.raynor-sc-1 [WRN] Health check  
failed: 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON)
2024-08-14T13:45:43.462665+0000 mgr.raynor-sc-3 [ERR] Unhandled  
exception from module 'cephadm' while running on mgr.raynor-sc-3:  
'cephadm'
2024-08-14T13:45:44.770711+0000 mon.raynor-sc-1 [ERR] Health check  
failed: Module 'cephadm' has failed: 'cephadm' (MGR_MODULE_ERROR)
2024-08-14T13:45:45.668379+0000 mon.raynor-sc-1 [INF] Active manager  
daemon raynor-sc-3 restarted
2024-08-14T13:45:45.673206+0000 mon.raynor-sc-1 [INF] Activating  
manager daemon raynor-sc-3
2024-08-14T13:45:45.673316+0000 mon.raynor-sc-1 [INF] Active manager  
daemon raynor-sc-3 restarted
2024-08-14T13:45:45.689515+0000 mon.raynor-sc-1 [INF] Active manager  
daemon raynor-sc-3 restarted
2024-08-14T13:45:45.694315+0000 mon.raynor-sc-1 [INF] Activating  
manager daemon raynor-sc-3
2024-08-14T13:45:47.671192+0000 mon.raynor-sc-1 [INF] Active manager  
daemon raynor-sc-3 restarted
2024-08-14T13:45:47.674805+0000 mon.raynor-sc-1 [INF] Activating  
manager daemon raynor-sc-3
2024-08-14T13:45:47.675037+0000 mon.raynor-sc-1 [INF] Active manager  
daemon raynor-sc-3 restarted
2024-08-14T13:45:47.697264+0000 mon.raynor-sc-1 [INF] Active manager  
daemon raynor-sc-3 restarted
2024-08-14T13:45:47.700886+0000 mon.raynor-sc-1 [INF] Activating  
manager daemon raynor-sc-3

And in the output of "ceph -W cephadm" I see:

2024-08-14T13:40:32.214742+0000 mgr.raynor-sc-1 [INF] Upgrade: First  
pull of localhost:5000/ceph/pacific:v16.2.15
2024-08-14T13:40:34.108767+0000 mgr.raynor-sc-1 [INF] Upgrade:  
Target is localhost:5000/ceph/pacific:v16.2.15 with id  
3c4eff6082ae7530e7eda038765ce400beb1bc1b8df67dffb45910eb45b06b2c
2024-08-14T13:40:34.112388+0000 mgr.raynor-sc-1 [INF] Upgrade:  
Checking mgr daemons...
2024-08-14T13:40:34.112722+0000 mgr.raynor-sc-1 [INF] Upgrade: Need  
to upgrade myself (mgr.raynor-sc-1)
2024-08-14T13:40:35.456432+0000 mgr.raynor-sc-1 [INF] It is presumed  
safe to stop ['mgr.raynor-sc-2']
2024-08-14T13:40:35.456620+0000 mgr.raynor-sc-1 [INF] Upgrade: It is  
presumed safe to stop ['mgr.raynor-sc-2']
2024-08-14T13:40:35.456771+0000 mgr.raynor-sc-1 [INF] Upgrade:  
Redeploying mgr.raynor-sc-2
2024-08-14T13:40:35.481790+0000 mgr.raynor-sc-1 [INF] Deploying  
daemon mgr.raynor-sc-2 on raynor-sc-2
2024-08-14T13:42:37.608895+0000 mgr.raynor-sc-1 [INF] refreshing  
raynor-sc-3 facts
2024-08-14T13:42:39.744098+0000 mgr.raynor-sc-1 [INF] refreshing  
raynor-sc-1 facts
2024-08-14T13:42:40.081740+0000 mgr.raynor-sc-1 [INF] refreshing  
raynor-sc-2 facts
2024-08-14T13:42:40.937375+0000 mgr.raynor-sc-1 [INF] Applying drive  
group all-available-devices on host raynor-sc-1...
2024-08-14T13:42:40.937732+0000 mgr.raynor-sc-1 [INF] Applying drive  
group all-available-devices on host raynor-sc-2...
2024-08-14T13:42:40.938079+0000 mgr.raynor-sc-1 [INF] Applying drive  
group all-available-devices on host raynor-sc-3...
2024-08-14T13:42:46.226231+0000 mgr.raynor-sc-1 [INF] Upgrade:  
Target is localhost:5000/ceph/pacific:v16.2.15 with id  
3c4eff6082ae7530e7eda038765ce400beb1bc1b8df67dffb45910eb45b06b2c
2024-08-14T13:42:46.229604+0000 mgr.raynor-sc-1 [INF] Upgrade:  
Checking mgr daemons...
2024-08-14T13:42:46.229727+0000 mgr.raynor-sc-1 [INF] Upgrade: Need  
to upgrade myself (mgr.raynor-sc-1)
2024-08-14T13:42:47.696110+0000 mgr.raynor-sc-1 [INF] It is presumed  
safe to stop ['mgr.raynor-sc-3']

My take on this is that we first start to attempt upgrade of the mgr  
on raynor-sc-2, and don't seem to detect it's not quite worked.  
Interestingly it seems that there is a two minute gap between the  
deploying line and the following line, I wonder if something is  
failing to come up and we're proceeding after a timer expires?

Any pointers are much appreciated.

Many thanks,
Alex

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx