Re: Cephadm Upgrade Issue

Adam King <adking@xxxxxxxxxx> · Wed, 14 Aug 2024 13:44:52 -0400

If you're referring to https://tracker.ceph.com/issues/57675, it got into
16.2.14, although there was another issue where running a `ceph orch
restart mgr` or `ceph orch redeploy mgr` would cause an endless loop of the
mgr daemons restarting, which would block all operations, that might be
what we were really dealing with here. That didn't have a tracker afaik,
but I believe was fixed by https://github.com/ceph/ceph/pull/41002. That
got into 16.2.4, but if the version being upgraded from was earlier than
this, the issue would have had to have been resolved before upgrade could
actually happen.

On Wed, Aug 14, 2024 at 1:07 PM Eugen Block <eblock@xxxxxx> wrote:

> A few of our customers were affected by that, but as far as I remember
> (I can look it up tomorrow), the actual issue popped up if they had
> more than two MGRs. But I believe it was resolved in a newer pacific
> version (don’t have the exact version on mind), which version did you
> try to upgrade to? There shouldn’t be any reason to remove other
> daemons.
>
>
> Zitat von "Alex Hussein-Kershaw (HE/HIM)" <alexhus@xxxxxxxxxxxxx>:
>
> > I spotted this: Performing a `ceph orch restart mgr` results in
> > endless restart loop | Support |
> > SUSE<https://www.suse.com/support/kb/doc/?id=000020530>, which
> > sounded quite similar, so I gave it a go and did:
> >
> > ceph orch daemon rm mgr.raynor-sc-1
> > < wait a bit for it to be created >
> > < repeat for each host >
> >
> > That seemed to solve my problem. I upgraded and it just worked.
> >
> > Did get me wondering if I should be doing the same for my monitors
> > (and even OSDs) post-adoption? They do seem to have a different
> > naming scheme.
> >
> > Best Wishes,
> > Alex
> >
> >
> > ________________________________
> > From: Alex Hussein-Kershaw (HE/HIM)
> > Sent: Wednesday, August 14, 2024 3:06 PM
> > To: ceph-users <ceph-users@xxxxxxx>
> > Subject: Cephadm Upgrade Issue
> >
> > Hi Folks,
> >
> > I'm prototyping the upgrade process for our Ceph Clusters. I've
> > adopted the Cluster following the docs, that works nicely 🙂 I then
> > load my docker image into a locally running container registry, as
> > I'm in a disconnected environment.  I have a test Cluster with 3 VMs
> > and no data, adopted at Octopus and upgrading to Pacific. I'm
> > running a MON, MGR, MDS and OSD on each VM.
> >
> > I then attempt to upgrade:
> > ceph orch upgrade start --image localhost:5000/ceph/pacific:v16.2.15
> >
> > Lots of logs below, but the summary appears to be that we initially
> > fail to upgrade the managers and get into a bad state. It looks like
> > there is some confusion in manager naming, and we end up with two
> > managers on each machine instead of one. Eventually Ceph reports a
> > health warning:
> >
> > $ ceph -s
> >   cluster:
> >     id:     e773d9c2-6d8d-4413-8e8f-e38f248f5959
> >     health: HEALTH_ERR
> >             1 failed cephadm daemon(s)
> >             Module 'cephadm' has failed: 'cephadm'
> >
> > That does seem to eventually clean its self up and the upgrade
> > appears to have completed ("ceph versions" shows everything on
> > Pacific), but it feels a bit bumpy. Hoping someone has some guidance
> > here. The containers on one host during upgrade are shown below.
> > Notice I somehow have two managers, where the names are a single
> > character different (a "-" replaced with a "."):
> >
> > $ docker ps | grep mgr
> > 2143b6f0e0e6   localhost:5000/ceph/pacific:v16.2.15
> > "/usr/bin/ceph-mgr -…"   About a minute ago   Up About a minute
> >        ceph-e773d9c2-6d8d-4413-8e8f-e38f248f5959-mgr.raynor-sc-2
> > 59c8cfddac64   ceph-daemon:v5.0.12-stable-5.0-octopus-centos-8
> > "/usr/bin/ceph-mgr -…"   14 minutes ago       Up 14 minutes
> >        ceph-e773d9c2-6d8d-4413-8e8f-e38f248f5959-mgr-raynor-sc-2
> >
> > In the output of "ceph -w" I see this sort of stuff:
> >
> > 2024-08-14T13:45:13.003405+0000 mon.raynor-sc-1 [INF] Manager daemon
> > raynor-sc-3 is now available
> > 2024-08-14T13:45:23.179699+0000 mon.raynor-sc-1 [ERR] Health check
> > failed: Module 'cephadm' has failed: 'cephadm' (MGR_MODULE_ERROR)
> > 2024-08-14T13:45:22.372376+0000 mgr.raynor-sc-3 [ERR] Unhandled
> > exception from module 'cephadm' while running on mgr.raynor-sc-3:
> > 'cephadm'
> > 2024-08-14T13:45:24.761961+0000 mon.raynor-sc-1 [INF] Active manager
> > daemon raynor-sc-3 restarted
> > 2024-08-14T13:45:24.766395+0000 mon.raynor-sc-1 [INF] Activating
> > manager daemon raynor-sc-3
> > 2024-08-14T13:45:31.800989+0000 mon.raynor-sc-1 [INF] Manager daemon
> > raynor-sc-3 is now available
> > 2024-08-14T13:45:32.874227+0000 mon.raynor-sc-1 [INF] Health check
> > cleared: MGR_MODULE_ERROR (was: Module 'cephadm' has failed:
> > 'cephadm')
> > 2024-08-14T13:45:32.874269+0000 mon.raynor-sc-1 [INF] Cluster is now
> healthy
> > 2024-08-14T13:45:33.664602+0000 mon.raynor-sc-1 [INF] Active manager
> > daemon raynor-sc-3 restarted
> > 2024-08-14T13:45:33.671809+0000 mon.raynor-sc-1 [INF] Activating
> > manager daemon raynor-sc-3
> > 2024-08-14T13:45:34.050292+0000 mon.raynor-sc-1 [INF] Manager daemon
> > raynor-sc-3 is now available
> > 2024-08-14T13:45:38.260385+0000 mon.raynor-sc-1 [WRN] Health check
> > failed: 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON)
> > 2024-08-14T13:45:43.462665+0000 mgr.raynor-sc-3 [ERR] Unhandled
> > exception from module 'cephadm' while running on mgr.raynor-sc-3:
> > 'cephadm'
> > 2024-08-14T13:45:44.770711+0000 mon.raynor-sc-1 [ERR] Health check
> > failed: Module 'cephadm' has failed: 'cephadm' (MGR_MODULE_ERROR)
> > 2024-08-14T13:45:45.668379+0000 mon.raynor-sc-1 [INF] Active manager
> > daemon raynor-sc-3 restarted
> > 2024-08-14T13:45:45.673206+0000 mon.raynor-sc-1 [INF] Activating
> > manager daemon raynor-sc-3
> > 2024-08-14T13:45:45.673316+0000 mon.raynor-sc-1 [INF] Active manager
> > daemon raynor-sc-3 restarted
> > 2024-08-14T13:45:45.689515+0000 mon.raynor-sc-1 [INF] Active manager
> > daemon raynor-sc-3 restarted
> > 2024-08-14T13:45:45.694315+0000 mon.raynor-sc-1 [INF] Activating
> > manager daemon raynor-sc-3
> > 2024-08-14T13:45:47.671192+0000 mon.raynor-sc-1 [INF] Active manager
> > daemon raynor-sc-3 restarted
> > 2024-08-14T13:45:47.674805+0000 mon.raynor-sc-1 [INF] Activating
> > manager daemon raynor-sc-3
> > 2024-08-14T13:45:47.675037+0000 mon.raynor-sc-1 [INF] Active manager
> > daemon raynor-sc-3 restarted
> > 2024-08-14T13:45:47.697264+0000 mon.raynor-sc-1 [INF] Active manager
> > daemon raynor-sc-3 restarted
> > 2024-08-14T13:45:47.700886+0000 mon.raynor-sc-1 [INF] Activating
> > manager daemon raynor-sc-3
> >
> > And in the output of "ceph -W cephadm" I see:
> >
> > 2024-08-14T13:40:32.214742+0000 mgr.raynor-sc-1 [INF] Upgrade: First
> > pull of localhost:5000/ceph/pacific:v16.2.15
> > 2024-08-14T13:40:34.108767+0000 mgr.raynor-sc-1 [INF] Upgrade:
> > Target is localhost:5000/ceph/pacific:v16.2.15 with id
> > 3c4eff6082ae7530e7eda038765ce400beb1bc1b8df67dffb45910eb45b06b2c
> > 2024-08-14T13:40:34.112388+0000 mgr.raynor-sc-1 [INF] Upgrade:
> > Checking mgr daemons...
> > 2024-08-14T13:40:34.112722+0000 mgr.raynor-sc-1 [INF] Upgrade: Need
> > to upgrade myself (mgr.raynor-sc-1)
> > 2024-08-14T13:40:35.456432+0000 mgr.raynor-sc-1 [INF] It is presumed
> > safe to stop ['mgr.raynor-sc-2']
> > 2024-08-14T13:40:35.456620+0000 mgr.raynor-sc-1 [INF] Upgrade: It is
> > presumed safe to stop ['mgr.raynor-sc-2']
> > 2024-08-14T13:40:35.456771+0000 mgr.raynor-sc-1 [INF] Upgrade:
> > Redeploying mgr.raynor-sc-2
> > 2024-08-14T13:40:35.481790+0000 mgr.raynor-sc-1 [INF] Deploying
> > daemon mgr.raynor-sc-2 on raynor-sc-2
> > 2024-08-14T13:42:37.608895+0000 mgr.raynor-sc-1 [INF] refreshing
> > raynor-sc-3 facts
> > 2024-08-14T13:42:39.744098+0000 mgr.raynor-sc-1 [INF] refreshing
> > raynor-sc-1 facts
> > 2024-08-14T13:42:40.081740+0000 mgr.raynor-sc-1 [INF] refreshing
> > raynor-sc-2 facts
> > 2024-08-14T13:42:40.937375+0000 mgr.raynor-sc-1 [INF] Applying drive
> > group all-available-devices on host raynor-sc-1...
> > 2024-08-14T13:42:40.937732+0000 mgr.raynor-sc-1 [INF] Applying drive
> > group all-available-devices on host raynor-sc-2...
> > 2024-08-14T13:42:40.938079+0000 mgr.raynor-sc-1 [INF] Applying drive
> > group all-available-devices on host raynor-sc-3...
> > 2024-08-14T13:42:46.226231+0000 mgr.raynor-sc-1 [INF] Upgrade:
> > Target is localhost:5000/ceph/pacific:v16.2.15 with id
> > 3c4eff6082ae7530e7eda038765ce400beb1bc1b8df67dffb45910eb45b06b2c
> > 2024-08-14T13:42:46.229604+0000 mgr.raynor-sc-1 [INF] Upgrade:
> > Checking mgr daemons...
> > 2024-08-14T13:42:46.229727+0000 mgr.raynor-sc-1 [INF] Upgrade: Need
> > to upgrade myself (mgr.raynor-sc-1)
> > 2024-08-14T13:42:47.696110+0000 mgr.raynor-sc-1 [INF] It is presumed
> > safe to stop ['mgr.raynor-sc-3']
> >
> > My take on this is that we first start to attempt upgrade of the mgr
> > on raynor-sc-2, and don't seem to detect it's not quite worked.
> > Interestingly it seems that there is a two minute gap between the
> > deploying line and the following line, I wonder if something is
> > failing to come up and we're proceeding after a timer expires?
> >
> > Any pointers are much appreciated.
> >
> > Many thanks,
> > Alex
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx