Re: pacific: ceph-mon services stopped after OSDs are out/down

Eugen Block <eblock@xxxxxx> · Wed, 14 Dec 2022 13:26:43 +0000

There's an existing tracker issue [1] that hasn't been updated since a  
year. The OP reported that restarting the other MONs did resolve it,  
have you tried that?

[1] https://tracker.ceph.com/issues/52760

Zitat von Mevludin Blazevic <mblazevic@xxxxxxxxxxxxxx>:

Its very strange. The keyring of the ceph monitor is the same as on  
one of the working monitor hosts. The failed mon and the working  
mons also have the same selinux policies and firewalld settings. The  
connection is also present since, all osd deamons are up on the  
failed ceph monitor node.

Am 13.12.2022 um 11:43 schrieb Eugen Block:
So you get "Permission denied" errors, I'm guessing either the mon  
keyring is not present (or wrong) or the mon directory doesn't  
belong to the ceph user. Can you check

ls -l /var/lib/ceph/FSID/mon.sparci-store1/

Compare the keyring file with the ones on the working mon nodes.

Zitat von Mevludin Blazevic <mblazevic@xxxxxxxxxxxxxx>:

Hi Eugen,

I assume the mon db is stored on the "OS disk". I could not find  
any error related lines in cephadm.log, here is what journalctl  
-xe tells me:

Dec 13 11:24:21 sparci-store1  
ceph-8c774934-1535-11ec-973e-525400130e4f-mon-sparci-store1[786211]: debug  
2022-12-13T10:24:21.392+0000 7f318e1fa700  1  
mon.sparci-store1@-1(???).paxosservice(auth 251..491) refresh  
upgraded, format 0 -> 3
Dec 13 11:24:21 sparci-store1  
ceph-8c774934-1535-11ec-973e-525400130e4f-mon-sparci-store1[786211]: debug  
2022-12-13T10:24:21.397+0000 7f3179248700  1 heartbeat_map  
reset_timeout 'Monitor::cpu_tp thread 0x7f3179248700' had timed  
out after 0.000000000s
Dec 13 11:24:21 sparci-store1  
ceph-8c774934-1535-11ec-973e-525400130e4f-mon-sparci-store1[786211]: debug  
2022-12-13T10:24:21.397+0000 7f318e1fa700  0  
mon.sparci-store1@-1(probing) e5  my rank is now 1 (was -1)
Dec 13 11:24:21 sparci-store1  
ceph-8c774934-1535-11ec-973e-525400130e4f-mon-sparci-store1[786211]: debug  
2022-12-13T10:24:21.398+0000 7f317ba4d700 -1  
mon.sparci-store1@1(probing) e5 handle_auth_bad_method hmm, they  
didn't like 2 result (13) Permission denied
Dec 13 11:24:21 sparci-store1 systemd[1]: Started Ceph  
mon.sparci-store1 for 8c774934-1535-11ec-973e-525400130e4f.
-- Subject: Unit  
ceph-8c774934-1535-11ec-973e-525400130e4f@mon.sparci-store1.service has  
finished start-up
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- Unit  
ceph-8c774934-1535-11ec-973e-525400130e4f@mon.sparci-store1.service has  
finished starting up.
--
-- The start-up result is done.
Dec 13 11:24:21 sparci-store1  
ceph-8c774934-1535-11ec-973e-525400130e4f-mon-sparci-store1[786211]: debug  
2022-12-13T10:24:21.599+0000 7f317ba4d700 -1  
mon.sparci-store1@1(probing) e5 handle_auth_bad_method hmm, they  
didn't like 2 result (13) Permission denied
Dec 13 11:24:21 sparci-store1  
ceph-8c774934-1535-11ec-973e-525400130e4f-mon-sparci-store1[786211]: debug  
2022-12-13T10:24:21.600+0000 7f3177a45700  0  
mon.sparci-store1@1(probing) e18  removed from monmap, suicide.
Dec 13 11:24:21 sparci-store1 systemd[1]:  
var-lib-containers-storage-overlay-2e67bce8ea3795683c4326479c7169a713e9a7630b31f25d60cd45bbd9fa56bd-merged.mount:  
Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit  
var-lib-containers-storage-overlay-2e67bce8ea3795683c4326479c7169a713e9a7630b31f25d60cd45bbd9fa56bd-merged.mount has successfully entered the 'dead'  
state.
Dec 13 11:24:21 sparci-store1 bash[786318]: Error: no container  
with name or ID  
"ceph-8c774934-1535-11ec-973e-525400130e4f-mon.sparci-store1"  
found: no such container
Dec 13 11:24:21 sparci-store1 bash[786346]: Error: no container  
with name or ID  
"ceph-8c774934-1535-11ec-973e-525400130e4f-mon-sparci-store1"  
found: no such container
Dec 13 11:24:21 sparci-store1 bash[786375]: Error: no container  
with name or ID  
"ceph-8c774934-1535-11ec-973e-525400130e4f-mon.sparci-store1"  
found: no such container
Dec 13 11:24:21 sparci-store1 systemd[1]:  
ceph-8c774934-1535-11ec-973e-525400130e4f@mon.sparci-store1.service:  
Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit  
ceph-8c774934-1535-11ec-973e-525400130e4f@mon.sparci-store1.service has  
successfully entered the 'dead' state.

Regards,

Mevludin

Am 08.12.2022 um 09:30 schrieb Eugen Block:
Hi,

do the MONs use the same SAS interface? They store the mon db on  
local disk, so it might be related. But without any logs or more  
details it's just guessing.

Regards,
Eugen

Zitat von Mevludin Blazevic <mblazevic@xxxxxxxxxxxxxx>:

Hi all,

I'm running Pacific with cephadm.

After installation, ceph automatically provisoned 5 ceph monitor  
nodes across the cluster. After a few OSDs crashed due to a  
hardware related issue with the SAS interface, 3 monitor  
services are stopped and won't restart again. Is it related to  
the OSD crash problem?

Thanks,
Mevludin

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Mevludin Blazevic, M.Sc.

University of Koblenz-Landau
Computing Centre (GHRKO)
Universitaetsstrasse 1
D-56070 Koblenz, Germany
Room A023
Tel: +49 261/287-1326

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Mevludin Blazevic, M.Sc.

University of Koblenz-Landau
Computing Centre (GHRKO)
Universitaetsstrasse 1
D-56070 Koblenz, Germany
Room A023
Tel: +49 261/287-1326

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx