Re: Cephadm Upgrade Issue

Adam King <adking@xxxxxxxxxx> · Wed, 14 Aug 2024 12:03:18 -0400

I don't think pacific has the upgrade error handling work so it's a bit
tougher to debug here. I think it should have printed a traceback into the
logs though. Maybe right after it crashes if you check `ceph log last 200
cephadm` there might be something. If not, you might need to do a `ceph mgr
fail` to try and get it to hit the error again.

On Wed, Aug 14, 2024 at 10:07 AM Alex Hussein-Kershaw (HE/HIM) <
alexhus@xxxxxxxxxxxxx> wrote:

> Hi Folks,
>
> I'm prototyping the upgrade process for our Ceph Clusters. I've adopted
> the Cluster following the docs, that works nicely 🙂 I then load my docker
> image into a locally running container registry, as I'm in a disconnected
> environment.  I have a test Cluster with 3 VMs and no data, adopted at
> Octopus and upgrading to Pacific. I'm running a MON, MGR, MDS and OSD on
> each VM.
>
> I then attempt to upgrade:
> ceph orch upgrade start --image localhost:5000/ceph/pacific:v16.2.15
>
> Lots of logs below, but the summary appears to be that we initially fail
> to upgrade the managers and get into a bad state. It looks like there is
> some confusion in manager naming, and we end up with two managers on each
> machine instead of one. Eventually Ceph reports a health warning:
>
> $ ceph -s
>   cluster:
>     id:     e773d9c2-6d8d-4413-8e8f-e38f248f5959
>     health: HEALTH_ERR
>             1 failed cephadm daemon(s)
>             Module 'cephadm' has failed: 'cephadm'
>
> That does seem to eventually clean its self up and the upgrade appears to
> have completed ("ceph versions" shows everything on Pacific), but it feels
> a bit bumpy. Hoping someone has some guidance here. The containers on one
> host during upgrade are shown below. Notice I somehow have two managers,
> where the names are a single character different (a "-" replaced with a
> "."):
>
> $ docker ps | grep mgr
> 2143b6f0e0e6   localhost:5000/ceph/pacific:v16.2.15
> "/usr/bin/ceph-mgr -…"   About a minute ago   Up About a minute
>  ceph-e773d9c2-6d8d-4413-8e8f-e38f248f5959-mgr.raynor-sc-2
> 59c8cfddac64   ceph-daemon:v5.0.12-stable-5.0-octopus-centos-8
>  "/usr/bin/ceph-mgr -…"   14 minutes ago       Up 14 minutes
>  ceph-e773d9c2-6d8d-4413-8e8f-e38f248f5959-mgr-raynor-sc-2
>
> In the output of "ceph -w" I see this sort of stuff:
>
> 2024-08-14T13:45:13.003405+0000 mon.raynor-sc-1 [INF] Manager daemon
> raynor-sc-3 is now available
> 2024-08-14T13:45:23.179699+0000 mon.raynor-sc-1 [ERR] Health check failed:
> Module 'cephadm' has failed: 'cephadm' (MGR_MODULE_ERROR)
> 2024-08-14T13:45:22.372376+0000 mgr.raynor-sc-3 [ERR] Unhandled exception
> from module 'cephadm' while running on mgr.raynor-sc-3: 'cephadm'
> 2024-08-14T13:45:24.761961+0000 mon.raynor-sc-1 [INF] Active manager
> daemon raynor-sc-3 restarted
> 2024-08-14T13:45:24.766395+0000 mon.raynor-sc-1 [INF] Activating manager
> daemon raynor-sc-3
> 2024-08-14T13:45:31.800989+0000 mon.raynor-sc-1 [INF] Manager daemon
> raynor-sc-3 is now available
> 2024-08-14T13:45:32.874227+0000 mon.raynor-sc-1 [INF] Health check
> cleared: MGR_MODULE_ERROR (was: Module 'cephadm' has failed: 'cephadm')
> 2024-08-14T13:45:32.874269+0000 mon.raynor-sc-1 [INF] Cluster is now
> healthy
> 2024-08-14T13:45:33.664602+0000 mon.raynor-sc-1 [INF] Active manager
> daemon raynor-sc-3 restarted
> 2024-08-14T13:45:33.671809+0000 mon.raynor-sc-1 [INF] Activating manager
> daemon raynor-sc-3
> 2024-08-14T13:45:34.050292+0000 mon.raynor-sc-1 [INF] Manager daemon
> raynor-sc-3 is now available
> 2024-08-14T13:45:38.260385+0000 mon.raynor-sc-1 [WRN] Health check failed:
> 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON)
> 2024-08-14T13:45:43.462665+0000 mgr.raynor-sc-3 [ERR] Unhandled exception
> from module 'cephadm' while running on mgr.raynor-sc-3: 'cephadm'
> 2024-08-14T13:45:44.770711+0000 mon.raynor-sc-1 [ERR] Health check failed:
> Module 'cephadm' has failed: 'cephadm' (MGR_MODULE_ERROR)
> 2024-08-14T13:45:45.668379+0000 mon.raynor-sc-1 [INF] Active manager
> daemon raynor-sc-3 restarted
> 2024-08-14T13:45:45.673206+0000 mon.raynor-sc-1 [INF] Activating manager
> daemon raynor-sc-3
> 2024-08-14T13:45:45.673316+0000 mon.raynor-sc-1 [INF] Active manager
> daemon raynor-sc-3 restarted
> 2024-08-14T13:45:45.689515+0000 mon.raynor-sc-1 [INF] Active manager
> daemon raynor-sc-3 restarted
> 2024-08-14T13:45:45.694315+0000 mon.raynor-sc-1 [INF] Activating manager
> daemon raynor-sc-3
> 2024-08-14T13:45:47.671192+0000 mon.raynor-sc-1 [INF] Active manager
> daemon raynor-sc-3 restarted
> 2024-08-14T13:45:47.674805+0000 mon.raynor-sc-1 [INF] Activating manager
> daemon raynor-sc-3
> 2024-08-14T13:45:47.675037+0000 mon.raynor-sc-1 [INF] Active manager
> daemon raynor-sc-3 restarted
> 2024-08-14T13:45:47.697264+0000 mon.raynor-sc-1 [INF] Active manager
> daemon raynor-sc-3 restarted
> 2024-08-14T13:45:47.700886+0000 mon.raynor-sc-1 [INF] Activating manager
> daemon raynor-sc-3
>
> And in the output of "ceph -W cephadm" I see:
>
> 2024-08-14T13:40:32.214742+0000 mgr.raynor-sc-1 [INF] Upgrade: First pull
> of localhost:5000/ceph/pacific:v16.2.15
> 2024-08-14T13:40:34.108767+0000 mgr.raynor-sc-1 [INF] Upgrade: Target is
> localhost:5000/ceph/pacific:v16.2.15 with id
> 3c4eff6082ae7530e7eda038765ce400beb1bc1b8df67dffb45910eb45b06b2c
> 2024-08-14T13:40:34.112388+0000 mgr.raynor-sc-1 [INF] Upgrade: Checking
> mgr daemons...
> 2024-08-14T13:40:34.112722+0000 mgr.raynor-sc-1 [INF] Upgrade: Need to
> upgrade myself (mgr.raynor-sc-1)
> 2024-08-14T13:40:35.456432+0000 mgr.raynor-sc-1 [INF] It is presumed safe
> to stop ['mgr.raynor-sc-2']
> 2024-08-14T13:40:35.456620+0000 mgr.raynor-sc-1 [INF] Upgrade: It is
> presumed safe to stop ['mgr.raynor-sc-2']
> 2024-08-14T13:40:35.456771+0000 mgr.raynor-sc-1 [INF] Upgrade: Redeploying
> mgr.raynor-sc-2
> 2024-08-14T13:40:35.481790+0000 mgr.raynor-sc-1 [INF] Deploying daemon
> mgr.raynor-sc-2 on raynor-sc-2
> 2024-08-14T13:42:37.608895+0000 mgr.raynor-sc-1 [INF] refreshing
> raynor-sc-3 facts
> 2024-08-14T13:42:39.744098+0000 mgr.raynor-sc-1 [INF] refreshing
> raynor-sc-1 facts
> 2024-08-14T13:42:40.081740+0000 mgr.raynor-sc-1 [INF] refreshing
> raynor-sc-2 facts
> 2024-08-14T13:42:40.937375+0000 mgr.raynor-sc-1 [INF] Applying drive group
> all-available-devices on host raynor-sc-1...
> 2024-08-14T13:42:40.937732+0000 mgr.raynor-sc-1 [INF] Applying drive group
> all-available-devices on host raynor-sc-2...
> 2024-08-14T13:42:40.938079+0000 mgr.raynor-sc-1 [INF] Applying drive group
> all-available-devices on host raynor-sc-3...
> 2024-08-14T13:42:46.226231+0000 mgr.raynor-sc-1 [INF] Upgrade: Target is
> localhost:5000/ceph/pacific:v16.2.15 with id
> 3c4eff6082ae7530e7eda038765ce400beb1bc1b8df67dffb45910eb45b06b2c
> 2024-08-14T13:42:46.229604+0000 mgr.raynor-sc-1 [INF] Upgrade: Checking
> mgr daemons...
> 2024-08-14T13:42:46.229727+0000 mgr.raynor-sc-1 [INF] Upgrade: Need to
> upgrade myself (mgr.raynor-sc-1)
> 2024-08-14T13:42:47.696110+0000 mgr.raynor-sc-1 [INF] It is presumed safe
> to stop ['mgr.raynor-sc-3']
>
> My take on this is that we first start to attempt upgrade of the mgr on
> raynor-sc-2, and don't seem to detect it's not quite worked. Interestingly
> it seems that there is a two minute gap between the deploying line and the
> following line, I wonder if something is failing to come up and we're
> proceeding after a timer expires?
>
> Any pointers are much appreciated.
>
> Many thanks,
> Alex
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx