Re: CEPH MON crash

Eugen Block <eblock@xxxxxx> · Wed, 07 Feb 2024 09:01:36 +0000

Hi Raul,

we had quite a similar issue last year. We removed the two failing  
MONs from the monmap, injected the reduced monmap into the surviving  
MON so it would have a quorum. After that the other daemons would  
start but we had to deal with large mon stores (I believe around 250  
GB or so) during this phase. IIRC we had to prevent it from compacting  
too often during startup (mon_compact_on_start = false) and also added  
SSD storage for the MON store so the sync speed would increase.
Eventually, we brought the cluster up into a healthy state and then  
added back the two crashed MONs. The root cause was that /var/ ran out  
of disk space. So in our case it definitely was not a bug. ;-)

Hope this helps!
Eugen

Zitat von Raul H C Lopes <raul.cardoso.lopes@xxxxxxx>:

Dear CEPH dev team,

I have a CEPH cluster with three MONs two of each are down. When I  
try to start them they crash and

journalctl shows that they crashed and a core dump was created.  
Would that be a bug? Or a corrupt DB?

I have then a third MON that starts fine but when I get mon_status  
through the admin-socket I see

 "quorum": []

 "state": "probing"

because of that  (I believe) I cannot use 'ceph orch' to create new  MONs.

So my questions:

 - Is there way that I can use 'ceph orch' to create new MONS?

 - Can I just rsync the store.db from this running node to the  
crashing MON nodes?

- Or do  I have to rebuild the store.db using a script as in

 https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds

Regards,

Raul
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx