Re: MONs are down, the quorum is unable to resolve.

Gaël THEROND <gael.therond@xxxxxxxxxxxx> · Tue, 13 Oct 2020 00:21:37 +0200

I’m not using rook although I think it will probably help a lot with that
recovery as rook is containers based too!

Thanks a lot!

Le mar. 13 oct. 2020 à 00:19, Brian Topping <brian.topping@xxxxxxxxx> a
écrit :

> I see, maybe you want to look at these instructions. I don’t know if you
> are running Rook, but the point about getting the container alive by using
> `sleep` is important. Then you can get into the container with `exec` and
> do what you need to.
>
>
> https://rook.io/docs/rook/v1.4/ceph-disaster-recovery.html#restoring-mon-quorum
>
>
> On Oct 12, 2020, at 4:16 PM, Gaël THEROND <gael.therond@xxxxxxxxxxxx>
> wrote:
>
> Hi Brian!
>
> Thanks a lot for your quick answer, it was fast !
>
> Yes, I’ve read this doc, yet I can’t perform appropriate commands as my
> OSDs are up and running.
>
> As my mon is a container if I try to use ceph-mon —extract it won’t work
> as the mon process is running and if I stop it the container will be
> restarted and I’ll be ousted off it.
>
> I can’t retrieve anything from ceph mon getmap as the quorum isn’t forming.
>
> Yep, I know that I would need three nodes and I have a third node
> available since recently for this lab.
>
> unfortunately it’s a lab cluster and so one of my colleagues just took the
> third node for testing purpose... I told you, a series of unfortunate
> events :-)
>
> I can’t get rid of the cluster as I can’t lost OSDs data.
>
> G.
>
> Le mar. 13 oct. 2020 à 00:01, Brian Topping <brian.topping@xxxxxxxxx> a
> écrit :
>
>> Hi there!
>>
>> This isn’t a difficult problem to fix. For purposes of clarity, the
>> monmap is just a part of the monitor database. You generally have all the
>> details correct though.
>>
>> Have you looked at the process in
>> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#recovering-a-monitor-s-broken-monmap?
>>
>> Please do make sure you are working on the copy of the monitor database
>> with the newest epoch. After removing the other monitors and getting your
>> cluster back online, you can re-add monitors at will.
>>
>> Also note that a quorum is defined as "one-half the total number of nodes
>> plus one”. In your case, quorum is defined by both nodes! Taking either
>> down would cause this problem. So you need to have an odd number of nodes
>> to provide the ability to take a node down, for instance in a rolling
>> upgrade.
>>
>> Hope that helps!
>>
>> Brian
>>
>> On Oct 12, 2020, at 3:54 PM, Gaël THEROND <gael.therond@xxxxxxxxxxxx>
>> wrote:
>>
>> Hi everyone,
>>
>> Because of unfortunate events, I’ve a containers based ceph cluster
>> (nautilus) in a bad shape.
>>
>> One of the lab cluster which is only made of 2 nodes as control plane (I
>> know it’s bad :-)) each of these nodes run a mon, a mgr and a rados-gw
>> containerized ceph_daemon.
>>
>> They were installed using ceph-ansible if relevant for anyone.
>>
>> However, when I was performing an upgrade on one of the first nodes, the
>> second went down too (electrical power outage).
>>
>> As soon as I saw that I stopped all current process within the upgrading
>> node.
>>
>> For now, if I try to restart my second node, as the quorum is looking for
>> two node the cluster isn’t available.
>>
>> The container start, the node elect itself as the master but all ceph
>> commands are stuck forever, which is perfectly normal as the quorum still
>> wait for one member to achieve the election process etc.
>>
>> So, my question is, as I can’t (to my knowledge) extract the monmap with
>> this intermediary state, and as my first node will still be considered as
>> a
>> known mon and try to join back if started properly, can I just copy the
>> /etc/ceph.conf and /var/lib/mon/<host>/keyring from the last living node
>> (the second one) and copy everything at its own place within the first
>> node? My mon keys were the same for both mon initially and if I’m not
>> making any mistakes my first node being blank will try to create a default
>> store, join the existing cluster and try to retrieve the appropriate
>> monmap
>> from the remaining node right?
>>
>> If not, is there a process to be able to save/extract the monmap when
>> using
>> a container based ceph ? I can perfectly exec on the remaining node if it
>> make any difference.
>>
>> Thanks a lot!
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx