in /var/lib/ceph/<fsid>/<mgr-daemon-name> on the host with that mgr reporting the error, there should be a unit.run file that shows what is being done to start the mgr as well as a few files that get mounted into the mgr on startup, notably the "config" and "keyring" files. That config file should include the mon host addresses. E.g. [root@vm-01 ~]# cat /var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/config # minimal ceph.conf for 5a72983c-ef57-11ed-a389-525400e42d74 [global] fsid = 5a72983c-ef57-11ed-a389-525400e42d74 mon_host = [v2:192.168.122.75:3300/0,v1:192.168.122.75:6789/0] [v2: 192.168.122.246:3300/0,v1:192.168.122.246:6789/0] [v2: 192.168.122.97:3300/0,v1:192.168.122.97:6789/0] The first thing I'd do is probably make sure that array of addresses is correct. Then you could probably check the keyring file as well and see if it matches up with what you get running "ceph auth get <mgr-daemon-name>". E.g. here [root@vm-01 ~]# cat /var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/keyring [mgr.vm-01.ilfvis] key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ== the key matches with [ceph: root@vm-00 /]# ceph auth get mgr.vm-01.ilfvis [mgr.vm-01.ilfvis] key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ== caps mds = "allow *" caps mon = "profile mgr" caps osd = "allow *" I wouldn't post them for obvious reasons (these are just on a test cluster I'll tear back down so it's fine for me) but those are the first couple things I'd check. You could also try to make adjustments directly to the unit.run file if you have other things you'd like to try. On Wed, May 10, 2023 at 11:09 AM Ben <ruidong.gao@xxxxxxxxx> wrote: > Hi, > This cluster is deployed by cephadm 17.2.5,containerized. > It ends up in this(no active mgr): > [root@8cd2c0657c77 /]# ceph -s > cluster: > id: ad3a132e-e9ee-11ed-8a19-043f72fb8bf9 > health: HEALTH_WARN > 6 hosts fail cephadm check > no active mgr > 1/3 mons down, quorum h18w,h19w > Degraded data redundancy: 781908/2345724 objects degraded > (33.333%), 101 pgs degraded, 209 pgs undersized > > services: > mon: 3 daemons, quorum h18w,h19w (age 19m), out of quorum: h15w > mgr: no daemons active (since 5h) > mds: 1/1 daemons up, 1 standby > osd: 9 osds: 6 up (since 5h), 6 in (since 5h) > rgw: 2 daemons active (2 hosts, 1 zones) > > data: > volumes: 1/1 healthy > pools: 8 pools, 209 pgs > objects: 781.91k objects, 152 GiB > usage: 312 GiB used, 54 TiB / 55 TiB avail > pgs: 781908/2345724 objects degraded (33.333%) > 108 active+undersized > 101 active+undersized+degraded > > I checked the h20w, there is a manager container running with log: > > debug 2023-05-10T12:43:23.315+0000 7f5e152ec000 0 monclient(hunting): > authenticate timed out after 300 > > debug 2023-05-10T12:48:23.318+0000 7f5e152ec000 0 monclient(hunting): > authenticate timed out after 300 > > debug 2023-05-10T12:53:23.318+0000 7f5e152ec000 0 monclient(hunting): > authenticate timed out after 300 > > debug 2023-05-10T12:58:23.319+0000 7f5e152ec000 0 monclient(hunting): > authenticate timed out after 300 > > debug 2023-05-10T13:03:23.319+0000 7f5e152ec000 0 monclient(hunting): > authenticate timed out after 300 > > debug 2023-05-10T13:08:23.319+0000 7f5e152ec000 0 monclient(hunting): > authenticate timed out after 300 > > debug 2023-05-10T13:13:23.319+0000 7f5e152ec000 0 monclient(hunting): > authenticate timed out after 300 > > > any idea to get a mgr up running again through cephadm? > > Thanks, > Ben > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx