A few of our customers were affected by that, but as far as I remember
(I can look it up tomorrow), the actual issue popped up if they had
more than two MGRs. But I believe it was resolved in a newer pacific
version (don’t have the exact version on mind), which version did you
try to upgrade to? There shouldn’t be any reason to remove other
daemons.
Zitat von "Alex Hussein-Kershaw (HE/HIM)" <alexhus@xxxxxxxxxxxxx>:
I spotted this: Performing a `ceph orch restart mgr` results in
endless restart loop | Support |
SUSE<https://www.suse.com/support/kb/doc/?id=000020530>, which
sounded quite similar, so I gave it a go and did:
ceph orch daemon rm mgr.raynor-sc-1
< wait a bit for it to be created >
< repeat for each host >
That seemed to solve my problem. I upgraded and it just worked.
Did get me wondering if I should be doing the same for my monitors
(and even OSDs) post-adoption? They do seem to have a different
naming scheme.
Best Wishes,
Alex
________________________________
From: Alex Hussein-Kershaw (HE/HIM)
Sent: Wednesday, August 14, 2024 3:06 PM
To: ceph-users <ceph-users@xxxxxxx>
Subject: Cephadm Upgrade Issue
Hi Folks,
I'm prototyping the upgrade process for our Ceph Clusters. I've
adopted the Cluster following the docs, that works nicely 🙂 I then
load my docker image into a locally running container registry, as
I'm in a disconnected environment. I have a test Cluster with 3 VMs
and no data, adopted at Octopus and upgrading to Pacific. I'm
running a MON, MGR, MDS and OSD on each VM.
I then attempt to upgrade:
ceph orch upgrade start --image localhost:5000/ceph/pacific:v16.2.15
Lots of logs below, but the summary appears to be that we initially
fail to upgrade the managers and get into a bad state. It looks like
there is some confusion in manager naming, and we end up with two
managers on each machine instead of one. Eventually Ceph reports a
health warning:
$ ceph -s
cluster:
id: e773d9c2-6d8d-4413-8e8f-e38f248f5959
health: HEALTH_ERR
1 failed cephadm daemon(s)
Module 'cephadm' has failed: 'cephadm'
That does seem to eventually clean its self up and the upgrade
appears to have completed ("ceph versions" shows everything on
Pacific), but it feels a bit bumpy. Hoping someone has some guidance
here. The containers on one host during upgrade are shown below.
Notice I somehow have two managers, where the names are a single
character different (a "-" replaced with a "."):
$ docker ps | grep mgr
2143b6f0e0e6 localhost:5000/ceph/pacific:v16.2.15
"/usr/bin/ceph-mgr -…" About a minute ago Up About a minute
ceph-e773d9c2-6d8d-4413-8e8f-e38f248f5959-mgr.raynor-sc-2
59c8cfddac64 ceph-daemon:v5.0.12-stable-5.0-octopus-centos-8
"/usr/bin/ceph-mgr -…" 14 minutes ago Up 14 minutes
ceph-e773d9c2-6d8d-4413-8e8f-e38f248f5959-mgr-raynor-sc-2
In the output of "ceph -w" I see this sort of stuff:
2024-08-14T13:45:13.003405+0000 mon.raynor-sc-1 [INF] Manager daemon
raynor-sc-3 is now available
2024-08-14T13:45:23.179699+0000 mon.raynor-sc-1 [ERR] Health check
failed: Module 'cephadm' has failed: 'cephadm' (MGR_MODULE_ERROR)
2024-08-14T13:45:22.372376+0000 mgr.raynor-sc-3 [ERR] Unhandled
exception from module 'cephadm' while running on mgr.raynor-sc-3:
'cephadm'
2024-08-14T13:45:24.761961+0000 mon.raynor-sc-1 [INF] Active manager
daemon raynor-sc-3 restarted
2024-08-14T13:45:24.766395+0000 mon.raynor-sc-1 [INF] Activating
manager daemon raynor-sc-3
2024-08-14T13:45:31.800989+0000 mon.raynor-sc-1 [INF] Manager daemon
raynor-sc-3 is now available
2024-08-14T13:45:32.874227+0000 mon.raynor-sc-1 [INF] Health check
cleared: MGR_MODULE_ERROR (was: Module 'cephadm' has failed:
'cephadm')
2024-08-14T13:45:32.874269+0000 mon.raynor-sc-1 [INF] Cluster is now healthy
2024-08-14T13:45:33.664602+0000 mon.raynor-sc-1 [INF] Active manager
daemon raynor-sc-3 restarted
2024-08-14T13:45:33.671809+0000 mon.raynor-sc-1 [INF] Activating
manager daemon raynor-sc-3
2024-08-14T13:45:34.050292+0000 mon.raynor-sc-1 [INF] Manager daemon
raynor-sc-3 is now available
2024-08-14T13:45:38.260385+0000 mon.raynor-sc-1 [WRN] Health check
failed: 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON)
2024-08-14T13:45:43.462665+0000 mgr.raynor-sc-3 [ERR] Unhandled
exception from module 'cephadm' while running on mgr.raynor-sc-3:
'cephadm'
2024-08-14T13:45:44.770711+0000 mon.raynor-sc-1 [ERR] Health check
failed: Module 'cephadm' has failed: 'cephadm' (MGR_MODULE_ERROR)
2024-08-14T13:45:45.668379+0000 mon.raynor-sc-1 [INF] Active manager
daemon raynor-sc-3 restarted
2024-08-14T13:45:45.673206+0000 mon.raynor-sc-1 [INF] Activating
manager daemon raynor-sc-3
2024-08-14T13:45:45.673316+0000 mon.raynor-sc-1 [INF] Active manager
daemon raynor-sc-3 restarted
2024-08-14T13:45:45.689515+0000 mon.raynor-sc-1 [INF] Active manager
daemon raynor-sc-3 restarted
2024-08-14T13:45:45.694315+0000 mon.raynor-sc-1 [INF] Activating
manager daemon raynor-sc-3
2024-08-14T13:45:47.671192+0000 mon.raynor-sc-1 [INF] Active manager
daemon raynor-sc-3 restarted
2024-08-14T13:45:47.674805+0000 mon.raynor-sc-1 [INF] Activating
manager daemon raynor-sc-3
2024-08-14T13:45:47.675037+0000 mon.raynor-sc-1 [INF] Active manager
daemon raynor-sc-3 restarted
2024-08-14T13:45:47.697264+0000 mon.raynor-sc-1 [INF] Active manager
daemon raynor-sc-3 restarted
2024-08-14T13:45:47.700886+0000 mon.raynor-sc-1 [INF] Activating
manager daemon raynor-sc-3
And in the output of "ceph -W cephadm" I see:
2024-08-14T13:40:32.214742+0000 mgr.raynor-sc-1 [INF] Upgrade: First
pull of localhost:5000/ceph/pacific:v16.2.15
2024-08-14T13:40:34.108767+0000 mgr.raynor-sc-1 [INF] Upgrade:
Target is localhost:5000/ceph/pacific:v16.2.15 with id
3c4eff6082ae7530e7eda038765ce400beb1bc1b8df67dffb45910eb45b06b2c
2024-08-14T13:40:34.112388+0000 mgr.raynor-sc-1 [INF] Upgrade:
Checking mgr daemons...
2024-08-14T13:40:34.112722+0000 mgr.raynor-sc-1 [INF] Upgrade: Need
to upgrade myself (mgr.raynor-sc-1)
2024-08-14T13:40:35.456432+0000 mgr.raynor-sc-1 [INF] It is presumed
safe to stop ['mgr.raynor-sc-2']
2024-08-14T13:40:35.456620+0000 mgr.raynor-sc-1 [INF] Upgrade: It is
presumed safe to stop ['mgr.raynor-sc-2']
2024-08-14T13:40:35.456771+0000 mgr.raynor-sc-1 [INF] Upgrade:
Redeploying mgr.raynor-sc-2
2024-08-14T13:40:35.481790+0000 mgr.raynor-sc-1 [INF] Deploying
daemon mgr.raynor-sc-2 on raynor-sc-2
2024-08-14T13:42:37.608895+0000 mgr.raynor-sc-1 [INF] refreshing
raynor-sc-3 facts
2024-08-14T13:42:39.744098+0000 mgr.raynor-sc-1 [INF] refreshing
raynor-sc-1 facts
2024-08-14T13:42:40.081740+0000 mgr.raynor-sc-1 [INF] refreshing
raynor-sc-2 facts
2024-08-14T13:42:40.937375+0000 mgr.raynor-sc-1 [INF] Applying drive
group all-available-devices on host raynor-sc-1...
2024-08-14T13:42:40.937732+0000 mgr.raynor-sc-1 [INF] Applying drive
group all-available-devices on host raynor-sc-2...
2024-08-14T13:42:40.938079+0000 mgr.raynor-sc-1 [INF] Applying drive
group all-available-devices on host raynor-sc-3...
2024-08-14T13:42:46.226231+0000 mgr.raynor-sc-1 [INF] Upgrade:
Target is localhost:5000/ceph/pacific:v16.2.15 with id
3c4eff6082ae7530e7eda038765ce400beb1bc1b8df67dffb45910eb45b06b2c
2024-08-14T13:42:46.229604+0000 mgr.raynor-sc-1 [INF] Upgrade:
Checking mgr daemons...
2024-08-14T13:42:46.229727+0000 mgr.raynor-sc-1 [INF] Upgrade: Need
to upgrade myself (mgr.raynor-sc-1)
2024-08-14T13:42:47.696110+0000 mgr.raynor-sc-1 [INF] It is presumed
safe to stop ['mgr.raynor-sc-3']
My take on this is that we first start to attempt upgrade of the mgr
on raynor-sc-2, and don't seem to detect it's not quite worked.
Interestingly it seems that there is a two minute gap between the
deploying line and the following line, I wonder if something is
failing to come up and we're proceeding after a timer expires?
Any pointers are much appreciated.
Many thanks,
Alex
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx