Re: Monitors not reaching quorum

Joao Eduardo Luis <joao@xxxxxxx> · Mon, 25 Jul 2016 18:16:35 +0100

On 07/25/2016 05:55 PM, Sergio A. de Carvalho Jr. wrote:
I just forced an NTP updated on all hosts to be sure it's down to clock
skew. I also checked that hosts can reach all other hosts on port 6789.

I then stopped monitor 0 (60z0m02) and started monitor 1 (60zxl02), but
the 3 monitors left (1 - 60zxl02, 2 - 610wl02, 4 - 615yl02) were still
having problems to reach quorum. That led me to believe monitor 4 was
the problem because I had a quorum before with monitors 0, 1, 2.

So I stopped monitor 4 and started monitor 0 again, but this time
monitors 0, 1, 2 failed to reach a quorum, which is rather puzzling.

All hosts are pretty much idle all the time so I can't see why monitors
would be getting stuck.

Grab 'ceph daemon mon.<ID> mon_status' from all monitors, set 'debug mon 
= 10', 'debug paxos = 10', 'debug ms = 1', grab logs from the monitors 
for the periods before, during and after election (ideally for at least 
two election cycles). Put them up somewhere on the interwebs and send us 
a link.

If you don't feel comfortable putting that link in the list, send me an 
email directly with the url.

I'll be happy to take a look later tonight.

  -Joao

On Mon, Jul 25, 2016 at 5:18 PM, Joao Eduardo Luis <joao@xxxxxxx
<mailto:joao@xxxxxxx>> wrote:

    On 07/25/2016 04:34 PM, Sergio A. de Carvalho Jr. wrote:

        Thanks, Joao.

        All monitors have the exact same mom map.

        I suspect you're right that there might be some communication
        problem
        though. I stopped monitor 1 (60zxl02), but the other 3 monitors
        still
        failed to reach a quorum. I could see monitor 0 was still declaring
        victory but the others were always calling for new elections:

        2016-07-25 15:18:59.775144 7f8760af7700  0 log_channel(cluster) log
        [INF] : mon.60z0m02@0 won leader election with quorum 0,2,4

        2016-07-25 15:18:54.702176 7fc1b357d700  1
        mon.610wl02@2(electing) e5
        handle_timecheck drop unexpected msg
        2016-07-25 15:18:54.704526 7fc1b357d700  1
        mon.610wl02@2(electing).data_health(11626) service_dispatch not in
        quorum -- drop message
        2016-07-25 15:19:09.792511 <tel:09.792511> 7fc1b3f7e700  1
        mon.610wl02@2(peon).paxos(paxos recovering c 1318755..1319322)
        lease_timeout -- calling new election
        2016-07-25 15:19:09.792825 <tel:09.792825> 7fc1b357d700  0
        log_channel(cluster) log
        [INF] : mon.610wl02 calling new monitor election

        I'm curious about the "handle_timecheck drop unexpected msg"
        message.

    timechecks (i.e., checking for clock skew), as well as the
    data_health service (which makes sure you have enough disk space in
    the mon data dir) are only run when you have a quorum. If a message
    is received by a monitor not in a quorum, regardless of state, it
    will be dropped.

    Assuming you know took one of the self-appointed leaders out - by
    shutting it down, for instance -, you should now check what's
    causing elections not to hold.

    In these cases, assuming your 3 monitors do form a quorum, the
    traditional issue tends to be 'lease timeouts'. I.e., the leader
    fails to provide a lease extension on paxos for the peons, and the
    peons assume the leader failed in some form (unresponsive, down,
    whatever).

    Above it does seem a lease timeout was triggered on a peon. This may
    have happened because:

    1. leader did not extend the lease
    2. leader did extend the lease but lease was in the past - usually
    indication of a clock skew on the leader, on the peons, or both.
    3. leader did extend the lease, it was with the correct time but
    peon failed to dispatch the message on time.

    Both 1. and 2. may be due to several factors, but most commonly it's
    because the monitor was stuck doing something. This something, more
    often than not, is leveldb. If this is the case, check the size of
    your leveldb. If it is over 5 or 6GB in size, you may need to
    manually compact the store (mon compact on start = true, iirc).

    HTH

       -Joao

        On Mon, Jul 25, 2016 at 4:10 PM, Joao Eduardo Luis <joao@xxxxxxx
        <mailto:joao@xxxxxxx>
        <mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>> wrote:

             On 07/25/2016 03:41 PM, Sergio A. de Carvalho Jr. wrote:

                 In the logs, there 2 monitors are constantly reporting
        that they
                 won the
                 leader election:

                 60z0m02 (monitor 0):
                 2016-07-25 14:31:11.644335 7f8760af7700  0
        log_channel(cluster) log
                 [INF] : mon.60z0m02@0 won leader election with quorum 0,2,4
                 2016-07-25 14:31:44.521552 7f8760af7700  1
                 mon.60z0m02@0(leader).paxos(paxos recovering c
        1318755..1319320)
                 collect
                 timeout, calling fresh election

                 60zxl02 (monitor 1):
                 2016-07-25 14:31:59.542346 7fefdeaed700  1
                 mon.60zxl02@1(electing).elector(11441) init, last seen
        epoch 11441
                 2016-07-25 14:32:04.583929 7fefdf4ee700  0
        log_channel(cluster) log
                 [INF] : mon.60zxl02@1 won leader election with quorum 1,2,4
                 2016-07-25 14:32:33.440103 7fefdf4ee700  1
                 mon.60zxl02@1(leader).paxos(paxos recovering c
        1318755..1319319)
                 collect
                 timeout, calling fresh election

             There are two likely scenarios to explain this:

             1. The monitors have different monitors in their monmaps - this
             could happen if you didn't add the new monitor via 'ceph
        mon add'.
             You can check this by running 'ceph daemon mon.<ID>
        mon_status' for
             each of the monitors in the cluster.

             2. some of the monitors are unable to communicate with each
        other,
             thus will never acknowledge the same leader. This does not
        mean you
             have two leaders for the same cluster, but it does mean
        that you
             will end up having two monitors declaring victory and
        become the
             self-proclaimed leader in the cluster. The peons should
        still only
             belong to one quorum.

             If this does not help you, try setting 'debug mon = 10' and
        'debug
             ms = 1' on the monitors and check the logs, making sure the
        monitors
             get the probes and follow the election process. If you need
        further
             assistance, put those logs online somewhere we can access
        them and
             we'll try to help you out.

                -Joao

                 On Mon, Jul 25, 2016 at 3:27 PM, Sergio A. de Carvalho Jr.
                 <scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>
        <mailto:scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>>
                 <mailto:scarvalhojr@xxxxxxxxx
        <mailto:scarvalhojr@xxxxxxxxx> <mailto:scarvalhojr@xxxxxxxxx
        <mailto:scarvalhojr@xxxxxxxxx>>>>

                 wrote:

                      Hi,

                      I have a cluster of 5 hosts running Ceph 0.94.6 on
        CentOS
                 6.5. On
                      each host, there is 1 monitor and 13 OSDs. We had
        an issue
                 with the
                      network and for some reason (which I still don't
        know why), the
                      servers were restarted. One host is still down,
        but the
                 monitors on
                      the 4 remaining servers are failing to enter a quorum.

                      I managed to get a quorum of 3 monitors by
        stopping all Ceph
                      monitors and OSDs across all machines, and
        bringing up the
                 top 3
                      ranked monitors in order of rank. After a few
        minutes, the
                 60z0m02
                      monitor (the top ranked one) became the leader:

                      {
                           "name": "60z0m02",
                           "rank": 0,
                           "state": "leader",
                           "election_epoch": 11328,
                           "quorum": [
                               0,
                               1,
                               2
                           ],
                           "outside_quorum": [],
                           "extra_probe_peers": [],
                           "sync_provider": [],
                           "monmap": {
                               "epoch": 5,
                               "fsid":
        "2f51a247-3155-4bcf-9aee-c6f6b2c5e2af",
                               "modified": "2016-04-28 22:26:48.604393",
                               "created": "0.000000",
                               "mons": [
                                   {
                                       "rank": 0,
                                       "name": "60z0m02",
                                       "addr": "10.98.2.166:6789
        <http://10.98.2.166:6789>
                 <http://10.98.2.166:6789> <http://10.98.2.166:6789>\/0"
                                   },
                                   {
                                       "rank": 1,
                                       "name": "60zxl02",
                                       "addr": "10.98.2.167:6789
        <http://10.98.2.167:6789>
                 <http://10.98.2.167:6789> <http://10.98.2.167:6789>\/0"
                                   },
                                   {
                                       "rank": 2,
                                       "name": "610wl02",
                                       "addr": "10.98.2.173:6789
        <http://10.98.2.173:6789>
                 <http://10.98.2.173:6789> <http://10.98.2.173:6789>\/0"
                                   },
                                   {
                                       "rank": 3,
                                       "name": "618yl02",
                                       "addr": "10.98.2.214:6789
        <http://10.98.2.214:6789>
                 <http://10.98.2.214:6789> <http://10.98.2.214:6789>\/0"
                                   },
                                   {
                                       "rank": 4,
                                       "name": "615yl02",
                                       "addr": "10.98.2.216:6789
        <http://10.98.2.216:6789>
                 <http://10.98.2.216:6789> <http://10.98.2.216:6789>\/0"

                                   }
                               ]
                           }
                      }

                      The other 2 monitors became peons:

                      "name": "60zxl02",
                           "rank": 1,
                           "state": "peon",
                           "election_epoch": 11328,
                           "quorum": [
                               0,
                               1,
                               2
                           ],

                      "name": "610wl02",
                           "rank": 2,
                           "state": "peon",
                           "election_epoch": 11328,
                           "quorum": [
                               0,
                               1,
                               2
                           ],

                      I then proceeded to start the fourth monitor, 615yl02
                 (618yl02 is
                      powered off), but after more than 2 hours and
        several election
                      rounds, the monitors still haven't reached a
        quorum. The
                 monitors
                      alternate mostly between "election", "probing"
        states but
                 they often
                      seem to be in different election epochs.

                      Is this normal?

                      Is there anything I can do to help the monitors
        elect a leader?
                      Should I manually remove the dead host's monitor
        from the
                 monitor map?

                      I left all OSD daemons stopped while the election
        is going on
                      purpose. Is this the best thing to do? Would
        bringing the
                 OSDs up
                      help or complicate matters even more? Or doesn't
        it make
                 any difference?

                      I don't see anything obviously wrong in the
        monitor logs.
                 They're
                      mostly filled with messages like the following:

                      2016-07-25 14:17:57.806148 7fc1b3f7e700  1
                      mon.610wl02@2(electing).elector(11411) init, last seen
                 epoch 11411
                      2016-07-25 14:17:57.829198 7fc1b7caf700  0
                 log_channel(audit) log
                      [DBG] : from='admin socket' entity='admin socket'
                 cmd='mon_status'
                      args=[]: dispatch
                      2016-07-25 14:17:57.829200 7fc1b7caf700  0
                 log_channel(audit) do_log
                      log to syslog
                      2016-07-25 14:17:57.829254 7fc1b7caf700  0
                 log_channel(audit) log
                      [DBG] : from='admin socket' entity='admin socket'
                 cmd=mon_status
                      args=[]: finished

                      Any help would be hugely appreciated.

                      Thanks,

                      Sergio

                 _______________________________________________
                 ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        <mailto:ceph-users@xxxxxxxxxxxxxx
        <mailto:ceph-users@xxxxxxxxxxxxxx>>
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

             _______________________________________________
             ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        <mailto:ceph-users@xxxxxxxxxxxxxx
        <mailto:ceph-users@xxxxxxxxxxxxxx>>
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com