Re: docker restarting lost all managers accidentally

Ben <ruidong.gao@xxxxxxxxx> · Fri, 12 May 2023 08:43:15 +0800

along the path you mentioned, it is fixed by changing the owner of
/var/lib/ceph to 167:167 from root. The cluster was deployed with non root
user, and files permission is in a bit of mess. After the change systemctl
daemon-reload and restart brings it up.

for another manager in bootstrap host, journal logs complains the following:
ay 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+0000
7f6b9bba5000 -1 monclient: keyring not found
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+0000
7f6b9bba5000 -1 auth: failed to load
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: (5) Input/output error
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+0000
7f6b9bba5000 -1 auth: error parsing file
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: error setting modifi>
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+0000
7f6b9bba5000 -1 auth: failed to load
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: (5) Input/output error
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+0000
7f6b9bba5000 -1 auth: error parsing file
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: error setting modifi>
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+0000
7f6b9bba5000 -1 auth: failed to load
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: (5) Input/output error
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+0000
7f6b9bba5000 -1 auth: error parsing file
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: error setting modifi>

the keyring has a base64 string and makes it original. mgr then is up as
well. There seems something inconsistent in bootstrapping a cluster.

Thank you all for help. It is now normal again.

Adam King <adking@xxxxxxxxxx> 于2023年5月11日周四 01:33写道：

> in /var/lib/ceph/<fsid>/<mgr-daemon-name> on the host with that mgr
> reporting the error, there should be a unit.run file that shows what is
> being done to start the mgr as well as a few files that get mounted into
> the mgr on startup, notably the "config" and "keyring" files. That config
> file should include the mon host addresses. E.g.
>
> [root@vm-01 ~]# cat
> /var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/config
> # minimal ceph.conf for 5a72983c-ef57-11ed-a389-525400e42d74
> [global]
> fsid = 5a72983c-ef57-11ed-a389-525400e42d74
> mon_host = [v2:192.168.122.75:3300/0,v1:192.168.122.75:6789/0] [v2:
> 192.168.122.246:3300/0,v1:192.168.122.246:6789/0] [v2:
> 192.168.122.97:3300/0,v1:192.168.122.97:6789/0]
>
> The first thing I'd do is probably make sure that array of addresses is
> correct.
>
> Then you could probably check the keyring file as well and see if it
> matches up with what you get running "ceph auth get <mgr-daemon-name>".
> E.g. here
>
> [root@vm-01 ~]# cat
> /var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/keyring
> [mgr.vm-01.ilfvis]
> key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ==
>
> the key matches with
>
> [ceph: root@vm-00 /]# ceph auth get mgr.vm-01.ilfvis
> [mgr.vm-01.ilfvis]
> key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ==
> caps mds = "allow *"
> caps mon = "profile mgr"
> caps osd = "allow *"
>
> I wouldn't post them for obvious reasons (these are just on a test cluster
> I'll tear back down so it's fine for me) but those are the first couple
> things I'd check. You could also try to make adjustments directly to the
> unit.run file if you have other things you'd like to try.
>
> On Wed, May 10, 2023 at 11:09 AM Ben <ruidong.gao@xxxxxxxxx> wrote:
>
>> Hi,
>> This cluster is deployed by cephadm 17.2.5,containerized.
>> It ends up in this(no active mgr):
>> [root@8cd2c0657c77 /]# ceph -s
>>   cluster:
>>     id:     ad3a132e-e9ee-11ed-8a19-043f72fb8bf9
>>     health: HEALTH_WARN
>>             6 hosts fail cephadm check
>>             no active mgr
>>             1/3 mons down, quorum h18w,h19w
>>             Degraded data redundancy: 781908/2345724 objects degraded
>> (33.333%), 101 pgs degraded, 209 pgs undersized
>>
>>   services:
>>     mon: 3 daemons, quorum h18w,h19w (age 19m), out of quorum: h15w
>>     mgr: no daemons active (since 5h)
>>     mds: 1/1 daemons up, 1 standby
>>     osd: 9 osds: 6 up (since 5h), 6 in (since 5h)
>>     rgw: 2 daemons active (2 hosts, 1 zones)
>>
>>   data:
>>     volumes: 1/1 healthy
>>     pools:   8 pools, 209 pgs
>>     objects: 781.91k objects, 152 GiB
>>     usage:   312 GiB used, 54 TiB / 55 TiB avail
>>     pgs:     781908/2345724 objects degraded (33.333%)
>>              108 active+undersized
>>              101 active+undersized+degraded
>>
>> I checked the h20w, there is a manager container running with log:
>>
>> debug 2023-05-10T12:43:23.315+0000 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after 300
>>
>> debug 2023-05-10T12:48:23.318+0000 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after 300
>>
>> debug 2023-05-10T12:53:23.318+0000 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after 300
>>
>> debug 2023-05-10T12:58:23.319+0000 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after 300
>>
>> debug 2023-05-10T13:03:23.319+0000 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after 300
>>
>> debug 2023-05-10T13:08:23.319+0000 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after 300
>>
>> debug 2023-05-10T13:13:23.319+0000 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after 300
>>
>>
>> any idea to get a mgr up running again through cephadm?
>>
>> Thanks,
>> Ben
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx