along the path you mentioned, it is fixed by changing the owner of /var/lib/ceph to 167:167 from root. The cluster was deployed with non root user, and files permission is in a bit of mess. After the change systemctl daemon-reload and restart brings it up. for another manager in bootstrap host, journal logs complains the following: ay 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+0000 7f6b9bba5000 -1 monclient: keyring not found May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+0000 7f6b9bba5000 -1 auth: failed to load /var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: (5) Input/output error May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+0000 7f6b9bba5000 -1 auth: error parsing file /var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: error setting modifi> May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+0000 7f6b9bba5000 -1 auth: failed to load /var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: (5) Input/output error May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+0000 7f6b9bba5000 -1 auth: error parsing file /var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: error setting modifi> May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+0000 7f6b9bba5000 -1 auth: failed to load /var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: (5) Input/output error May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+0000 7f6b9bba5000 -1 auth: error parsing file /var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: error setting modifi> the keyring has a base64 string and makes it original. mgr then is up as well. There seems something inconsistent in bootstrapping a cluster. Thank you all for help. It is now normal again. Adam King <adking@xxxxxxxxxx> 于2023年5月11日周四 01:33写道: > in /var/lib/ceph/<fsid>/<mgr-daemon-name> on the host with that mgr > reporting the error, there should be a unit.run file that shows what is > being done to start the mgr as well as a few files that get mounted into > the mgr on startup, notably the "config" and "keyring" files. That config > file should include the mon host addresses. E.g. > > [root@vm-01 ~]# cat > /var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/config > # minimal ceph.conf for 5a72983c-ef57-11ed-a389-525400e42d74 > [global] > fsid = 5a72983c-ef57-11ed-a389-525400e42d74 > mon_host = [v2:192.168.122.75:3300/0,v1:192.168.122.75:6789/0] [v2: > 192.168.122.246:3300/0,v1:192.168.122.246:6789/0] [v2: > 192.168.122.97:3300/0,v1:192.168.122.97:6789/0] > > The first thing I'd do is probably make sure that array of addresses is > correct. > > Then you could probably check the keyring file as well and see if it > matches up with what you get running "ceph auth get <mgr-daemon-name>". > E.g. here > > [root@vm-01 ~]# cat > /var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/keyring > [mgr.vm-01.ilfvis] > key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ== > > the key matches with > > [ceph: root@vm-00 /]# ceph auth get mgr.vm-01.ilfvis > [mgr.vm-01.ilfvis] > key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ== > caps mds = "allow *" > caps mon = "profile mgr" > caps osd = "allow *" > > I wouldn't post them for obvious reasons (these are just on a test cluster > I'll tear back down so it's fine for me) but those are the first couple > things I'd check. You could also try to make adjustments directly to the > unit.run file if you have other things you'd like to try. > > On Wed, May 10, 2023 at 11:09 AM Ben <ruidong.gao@xxxxxxxxx> wrote: > >> Hi, >> This cluster is deployed by cephadm 17.2.5,containerized. >> It ends up in this(no active mgr): >> [root@8cd2c0657c77 /]# ceph -s >> cluster: >> id: ad3a132e-e9ee-11ed-8a19-043f72fb8bf9 >> health: HEALTH_WARN >> 6 hosts fail cephadm check >> no active mgr >> 1/3 mons down, quorum h18w,h19w >> Degraded data redundancy: 781908/2345724 objects degraded >> (33.333%), 101 pgs degraded, 209 pgs undersized >> >> services: >> mon: 3 daemons, quorum h18w,h19w (age 19m), out of quorum: h15w >> mgr: no daemons active (since 5h) >> mds: 1/1 daemons up, 1 standby >> osd: 9 osds: 6 up (since 5h), 6 in (since 5h) >> rgw: 2 daemons active (2 hosts, 1 zones) >> >> data: >> volumes: 1/1 healthy >> pools: 8 pools, 209 pgs >> objects: 781.91k objects, 152 GiB >> usage: 312 GiB used, 54 TiB / 55 TiB avail >> pgs: 781908/2345724 objects degraded (33.333%) >> 108 active+undersized >> 101 active+undersized+degraded >> >> I checked the h20w, there is a manager container running with log: >> >> debug 2023-05-10T12:43:23.315+0000 7f5e152ec000 0 monclient(hunting): >> authenticate timed out after 300 >> >> debug 2023-05-10T12:48:23.318+0000 7f5e152ec000 0 monclient(hunting): >> authenticate timed out after 300 >> >> debug 2023-05-10T12:53:23.318+0000 7f5e152ec000 0 monclient(hunting): >> authenticate timed out after 300 >> >> debug 2023-05-10T12:58:23.319+0000 7f5e152ec000 0 monclient(hunting): >> authenticate timed out after 300 >> >> debug 2023-05-10T13:03:23.319+0000 7f5e152ec000 0 monclient(hunting): >> authenticate timed out after 300 >> >> debug 2023-05-10T13:08:23.319+0000 7f5e152ec000 0 monclient(hunting): >> authenticate timed out after 300 >> >> debug 2023-05-10T13:13:23.319+0000 7f5e152ec000 0 monclient(hunting): >> authenticate timed out after 300 >> >> >> any idea to get a mgr up running again through cephadm? >> >> Thanks, >> Ben >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx