Re: Cephadm Upgrade Issue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



A few of our customers were affected by that, but as far as I remember (I can look it up tomorrow), the actual issue popped up if they had more than two MGRs. But I believe it was resolved in a newer pacific version (don’t have the exact version on mind), which version did you try to upgrade to? There shouldn’t be any reason to remove other daemons.


Zitat von "Alex Hussein-Kershaw (HE/HIM)" <alexhus@xxxxxxxxxxxxx>:

I spotted this: Performing a `ceph orch restart mgr` results in endless restart loop | Support | SUSE<https://www.suse.com/support/kb/doc/?id=000020530>, which sounded quite similar, so I gave it a go and did:

ceph orch daemon rm mgr.raynor-sc-1
< wait a bit for it to be created >
< repeat for each host >

That seemed to solve my problem. I upgraded and it just worked.

Did get me wondering if I should be doing the same for my monitors (and even OSDs) post-adoption? They do seem to have a different naming scheme.

Best Wishes,
Alex


________________________________
From: Alex Hussein-Kershaw (HE/HIM)
Sent: Wednesday, August 14, 2024 3:06 PM
To: ceph-users <ceph-users@xxxxxxx>
Subject: Cephadm Upgrade Issue

Hi Folks,

I'm prototyping the upgrade process for our Ceph Clusters. I've adopted the Cluster following the docs, that works nicely 🙂 I then load my docker image into a locally running container registry, as I'm in a disconnected environment. I have a test Cluster with 3 VMs and no data, adopted at Octopus and upgrading to Pacific. I'm running a MON, MGR, MDS and OSD on each VM.

I then attempt to upgrade:
ceph orch upgrade start --image localhost:5000/ceph/pacific:v16.2.15

Lots of logs below, but the summary appears to be that we initially fail to upgrade the managers and get into a bad state. It looks like there is some confusion in manager naming, and we end up with two managers on each machine instead of one. Eventually Ceph reports a health warning:

$ ceph -s
  cluster:
    id:     e773d9c2-6d8d-4413-8e8f-e38f248f5959
    health: HEALTH_ERR
            1 failed cephadm daemon(s)
            Module 'cephadm' has failed: 'cephadm'

That does seem to eventually clean its self up and the upgrade appears to have completed ("ceph versions" shows everything on Pacific), but it feels a bit bumpy. Hoping someone has some guidance here. The containers on one host during upgrade are shown below. Notice I somehow have two managers, where the names are a single character different (a "-" replaced with a "."):

$ docker ps | grep mgr
2143b6f0e0e6 localhost:5000/ceph/pacific:v16.2.15 "/usr/bin/ceph-mgr -…" About a minute ago Up About a minute ceph-e773d9c2-6d8d-4413-8e8f-e38f248f5959-mgr.raynor-sc-2 59c8cfddac64 ceph-daemon:v5.0.12-stable-5.0-octopus-centos-8 "/usr/bin/ceph-mgr -…" 14 minutes ago Up 14 minutes ceph-e773d9c2-6d8d-4413-8e8f-e38f248f5959-mgr-raynor-sc-2

In the output of "ceph -w" I see this sort of stuff:

2024-08-14T13:45:13.003405+0000 mon.raynor-sc-1 [INF] Manager daemon raynor-sc-3 is now available 2024-08-14T13:45:23.179699+0000 mon.raynor-sc-1 [ERR] Health check failed: Module 'cephadm' has failed: 'cephadm' (MGR_MODULE_ERROR) 2024-08-14T13:45:22.372376+0000 mgr.raynor-sc-3 [ERR] Unhandled exception from module 'cephadm' while running on mgr.raynor-sc-3: 'cephadm' 2024-08-14T13:45:24.761961+0000 mon.raynor-sc-1 [INF] Active manager daemon raynor-sc-3 restarted 2024-08-14T13:45:24.766395+0000 mon.raynor-sc-1 [INF] Activating manager daemon raynor-sc-3 2024-08-14T13:45:31.800989+0000 mon.raynor-sc-1 [INF] Manager daemon raynor-sc-3 is now available 2024-08-14T13:45:32.874227+0000 mon.raynor-sc-1 [INF] Health check cleared: MGR_MODULE_ERROR (was: Module 'cephadm' has failed: 'cephadm')
2024-08-14T13:45:32.874269+0000 mon.raynor-sc-1 [INF] Cluster is now healthy
2024-08-14T13:45:33.664602+0000 mon.raynor-sc-1 [INF] Active manager daemon raynor-sc-3 restarted 2024-08-14T13:45:33.671809+0000 mon.raynor-sc-1 [INF] Activating manager daemon raynor-sc-3 2024-08-14T13:45:34.050292+0000 mon.raynor-sc-1 [INF] Manager daemon raynor-sc-3 is now available 2024-08-14T13:45:38.260385+0000 mon.raynor-sc-1 [WRN] Health check failed: 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON) 2024-08-14T13:45:43.462665+0000 mgr.raynor-sc-3 [ERR] Unhandled exception from module 'cephadm' while running on mgr.raynor-sc-3: 'cephadm' 2024-08-14T13:45:44.770711+0000 mon.raynor-sc-1 [ERR] Health check failed: Module 'cephadm' has failed: 'cephadm' (MGR_MODULE_ERROR) 2024-08-14T13:45:45.668379+0000 mon.raynor-sc-1 [INF] Active manager daemon raynor-sc-3 restarted 2024-08-14T13:45:45.673206+0000 mon.raynor-sc-1 [INF] Activating manager daemon raynor-sc-3 2024-08-14T13:45:45.673316+0000 mon.raynor-sc-1 [INF] Active manager daemon raynor-sc-3 restarted 2024-08-14T13:45:45.689515+0000 mon.raynor-sc-1 [INF] Active manager daemon raynor-sc-3 restarted 2024-08-14T13:45:45.694315+0000 mon.raynor-sc-1 [INF] Activating manager daemon raynor-sc-3 2024-08-14T13:45:47.671192+0000 mon.raynor-sc-1 [INF] Active manager daemon raynor-sc-3 restarted 2024-08-14T13:45:47.674805+0000 mon.raynor-sc-1 [INF] Activating manager daemon raynor-sc-3 2024-08-14T13:45:47.675037+0000 mon.raynor-sc-1 [INF] Active manager daemon raynor-sc-3 restarted 2024-08-14T13:45:47.697264+0000 mon.raynor-sc-1 [INF] Active manager daemon raynor-sc-3 restarted 2024-08-14T13:45:47.700886+0000 mon.raynor-sc-1 [INF] Activating manager daemon raynor-sc-3

And in the output of "ceph -W cephadm" I see:

2024-08-14T13:40:32.214742+0000 mgr.raynor-sc-1 [INF] Upgrade: First pull of localhost:5000/ceph/pacific:v16.2.15 2024-08-14T13:40:34.108767+0000 mgr.raynor-sc-1 [INF] Upgrade: Target is localhost:5000/ceph/pacific:v16.2.15 with id 3c4eff6082ae7530e7eda038765ce400beb1bc1b8df67dffb45910eb45b06b2c 2024-08-14T13:40:34.112388+0000 mgr.raynor-sc-1 [INF] Upgrade: Checking mgr daemons... 2024-08-14T13:40:34.112722+0000 mgr.raynor-sc-1 [INF] Upgrade: Need to upgrade myself (mgr.raynor-sc-1) 2024-08-14T13:40:35.456432+0000 mgr.raynor-sc-1 [INF] It is presumed safe to stop ['mgr.raynor-sc-2'] 2024-08-14T13:40:35.456620+0000 mgr.raynor-sc-1 [INF] Upgrade: It is presumed safe to stop ['mgr.raynor-sc-2'] 2024-08-14T13:40:35.456771+0000 mgr.raynor-sc-1 [INF] Upgrade: Redeploying mgr.raynor-sc-2 2024-08-14T13:40:35.481790+0000 mgr.raynor-sc-1 [INF] Deploying daemon mgr.raynor-sc-2 on raynor-sc-2 2024-08-14T13:42:37.608895+0000 mgr.raynor-sc-1 [INF] refreshing raynor-sc-3 facts 2024-08-14T13:42:39.744098+0000 mgr.raynor-sc-1 [INF] refreshing raynor-sc-1 facts 2024-08-14T13:42:40.081740+0000 mgr.raynor-sc-1 [INF] refreshing raynor-sc-2 facts 2024-08-14T13:42:40.937375+0000 mgr.raynor-sc-1 [INF] Applying drive group all-available-devices on host raynor-sc-1... 2024-08-14T13:42:40.937732+0000 mgr.raynor-sc-1 [INF] Applying drive group all-available-devices on host raynor-sc-2... 2024-08-14T13:42:40.938079+0000 mgr.raynor-sc-1 [INF] Applying drive group all-available-devices on host raynor-sc-3... 2024-08-14T13:42:46.226231+0000 mgr.raynor-sc-1 [INF] Upgrade: Target is localhost:5000/ceph/pacific:v16.2.15 with id 3c4eff6082ae7530e7eda038765ce400beb1bc1b8df67dffb45910eb45b06b2c 2024-08-14T13:42:46.229604+0000 mgr.raynor-sc-1 [INF] Upgrade: Checking mgr daemons... 2024-08-14T13:42:46.229727+0000 mgr.raynor-sc-1 [INF] Upgrade: Need to upgrade myself (mgr.raynor-sc-1) 2024-08-14T13:42:47.696110+0000 mgr.raynor-sc-1 [INF] It is presumed safe to stop ['mgr.raynor-sc-3']

My take on this is that we first start to attempt upgrade of the mgr on raynor-sc-2, and don't seem to detect it's not quite worked. Interestingly it seems that there is a two minute gap between the deploying line and the following line, I wonder if something is failing to come up and we're proceeding after a timer expires?

Any pointers are much appreciated.

Many thanks,
Alex


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux