Re: docker restarting lost all managers accidentally

Adam King <adking@xxxxxxxxxx> · Wed, 10 May 2023 13:33:10 -0400

in /var/lib/ceph/<fsid>/<mgr-daemon-name> on the host with that mgr
reporting the error, there should be a unit.run file that shows what is
being done to start the mgr as well as a few files that get mounted into
the mgr on startup, notably the "config" and "keyring" files. That config
file should include the mon host addresses. E.g.

[root@vm-01 ~]# cat
/var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/config
# minimal ceph.conf for 5a72983c-ef57-11ed-a389-525400e42d74
[global]
fsid = 5a72983c-ef57-11ed-a389-525400e42d74
mon_host = [v2:192.168.122.75:3300/0,v1:192.168.122.75:6789/0] [v2:
192.168.122.246:3300/0,v1:192.168.122.246:6789/0] [v2:
192.168.122.97:3300/0,v1:192.168.122.97:6789/0]

The first thing I'd do is probably make sure that array of addresses is
correct.

Then you could probably check the keyring file as well and see if it
matches up with what you get running "ceph auth get <mgr-daemon-name>".
E.g. here

[root@vm-01 ~]# cat
/var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/keyring
[mgr.vm-01.ilfvis]
key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ==

the key matches with

[ceph: root@vm-00 /]# ceph auth get mgr.vm-01.ilfvis
[mgr.vm-01.ilfvis]
key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ==
caps mds = "allow *"
caps mon = "profile mgr"
caps osd = "allow *"

I wouldn't post them for obvious reasons (these are just on a test cluster
I'll tear back down so it's fine for me) but those are the first couple
things I'd check. You could also try to make adjustments directly to the
unit.run file if you have other things you'd like to try.

On Wed, May 10, 2023 at 11:09 AM Ben <ruidong.gao@xxxxxxxxx> wrote:

> Hi,
> This cluster is deployed by cephadm 17.2.5,containerized.
> It ends up in this(no active mgr):
> [root@8cd2c0657c77 /]# ceph -s
>   cluster:
>     id:     ad3a132e-e9ee-11ed-8a19-043f72fb8bf9
>     health: HEALTH_WARN
>             6 hosts fail cephadm check
>             no active mgr
>             1/3 mons down, quorum h18w,h19w
>             Degraded data redundancy: 781908/2345724 objects degraded
> (33.333%), 101 pgs degraded, 209 pgs undersized
>
>   services:
>     mon: 3 daemons, quorum h18w,h19w (age 19m), out of quorum: h15w
>     mgr: no daemons active (since 5h)
>     mds: 1/1 daemons up, 1 standby
>     osd: 9 osds: 6 up (since 5h), 6 in (since 5h)
>     rgw: 2 daemons active (2 hosts, 1 zones)
>
>   data:
>     volumes: 1/1 healthy
>     pools:   8 pools, 209 pgs
>     objects: 781.91k objects, 152 GiB
>     usage:   312 GiB used, 54 TiB / 55 TiB avail
>     pgs:     781908/2345724 objects degraded (33.333%)
>              108 active+undersized
>              101 active+undersized+degraded
>
> I checked the h20w, there is a manager container running with log:
>
> debug 2023-05-10T12:43:23.315+0000 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T12:48:23.318+0000 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T12:53:23.318+0000 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T12:58:23.319+0000 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T13:03:23.319+0000 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T13:08:23.319+0000 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T13:13:23.319+0000 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
>
> any idea to get a mgr up running again through cephadm?
>
> Thanks,
> Ben
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx