Re: Monitors not reaching quorum

Joao Eduardo Luis <joao@xxxxxxx> · Mon, 25 Jul 2016 17:18:17 +0100

On 07/25/2016 04:34 PM, Sergio A. de Carvalho Jr. wrote:
Thanks, Joao.

All monitors have the exact same mom map.

I suspect you're right that there might be some communication problem
though. I stopped monitor 1 (60zxl02), but the other 3 monitors still
failed to reach a quorum. I could see monitor 0 was still declaring
victory but the others were always calling for new elections:

2016-07-25 15:18:59.775144 7f8760af7700  0 log_channel(cluster) log
[INF] : mon.60z0m02@0 won leader election with quorum 0,2,4

2016-07-25 15:18:54.702176 7fc1b357d700  1 mon.610wl02@2(electing) e5
handle_timecheck drop unexpected msg
2016-07-25 15:18:54.704526 7fc1b357d700  1
mon.610wl02@2(electing).data_health(11626) service_dispatch not in
quorum -- drop message
2016-07-25 15:19:09.792511 7fc1b3f7e700  1
mon.610wl02@2(peon).paxos(paxos recovering c 1318755..1319322)
lease_timeout -- calling new election
2016-07-25 15:19:09.792825 7fc1b357d700  0 log_channel(cluster) log
[INF] : mon.610wl02 calling new monitor election

I'm curious about the "handle_timecheck drop unexpected msg" message.

timechecks (i.e., checking for clock skew), as well as the data_health 
service (which makes sure you have enough disk space in the mon data 
dir) are only run when you have a quorum. If a message is received by a 
monitor not in a quorum, regardless of state, it will be dropped.

Assuming you know took one of the self-appointed leaders out - by 
shutting it down, for instance -, you should now check what's causing 
elections not to hold.

In these cases, assuming your 3 monitors do form a quorum, the 
traditional issue tends to be 'lease timeouts'. I.e., the leader fails 
to provide a lease extension on paxos for the peons, and the peons 
assume the leader failed in some form (unresponsive, down, whatever).

Above it does seem a lease timeout was triggered on a peon. This may 
have happened because:

1. leader did not extend the lease
2. leader did extend the lease but lease was in the past - usually 
indication of a clock skew on the leader, on the peons, or both.
3. leader did extend the lease, it was with the correct time but peon 
failed to dispatch the message on time.

Both 1. and 2. may be due to several factors, but most commonly it's 
because the monitor was stuck doing something. This something, more 
often than not, is leveldb. If this is the case, check the size of your 
leveldb. If it is over 5 or 6GB in size, you may need to manually 
compact the store (mon compact on start = true, iirc).

HTH

  -Joao

On Mon, Jul 25, 2016 at 4:10 PM, Joao Eduardo Luis <joao@xxxxxxx
<mailto:joao@xxxxxxx>> wrote:

    On 07/25/2016 03:41 PM, Sergio A. de Carvalho Jr. wrote:

        In the logs, there 2 monitors are constantly reporting that they
        won the
        leader election:

        60z0m02 (monitor 0):
        2016-07-25 14:31:11.644335 7f8760af7700  0 log_channel(cluster) log
        [INF] : mon.60z0m02@0 won leader election with quorum 0,2,4
        2016-07-25 14:31:44.521552 7f8760af7700  1
        mon.60z0m02@0(leader).paxos(paxos recovering c 1318755..1319320)
        collect
        timeout, calling fresh election

        60zxl02 (monitor 1):
        2016-07-25 14:31:59.542346 7fefdeaed700  1
        mon.60zxl02@1(electing).elector(11441) init, last seen epoch 11441
        2016-07-25 14:32:04.583929 7fefdf4ee700  0 log_channel(cluster) log
        [INF] : mon.60zxl02@1 won leader election with quorum 1,2,4
        2016-07-25 14:32:33.440103 7fefdf4ee700  1
        mon.60zxl02@1(leader).paxos(paxos recovering c 1318755..1319319)
        collect
        timeout, calling fresh election

    There are two likely scenarios to explain this:

    1. The monitors have different monitors in their monmaps - this
    could happen if you didn't add the new monitor via 'ceph mon add'.
    You can check this by running 'ceph daemon mon.<ID> mon_status' for
    each of the monitors in the cluster.

    2. some of the monitors are unable to communicate with each other,
    thus will never acknowledge the same leader. This does not mean you
    have two leaders for the same cluster, but it does mean that you
    will end up having two monitors declaring victory and become the
    self-proclaimed leader in the cluster. The peons should still only
    belong to one quorum.

    If this does not help you, try setting 'debug mon = 10' and 'debug
    ms = 1' on the monitors and check the logs, making sure the monitors
    get the probes and follow the election process. If you need further
    assistance, put those logs online somewhere we can access them and
    we'll try to help you out.

       -Joao

        On Mon, Jul 25, 2016 at 3:27 PM, Sergio A. de Carvalho Jr.
        <scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>
        <mailto:scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>>>
        wrote:

             Hi,

             I have a cluster of 5 hosts running Ceph 0.94.6 on CentOS
        6.5. On
             each host, there is 1 monitor and 13 OSDs. We had an issue
        with the
             network and for some reason (which I still don't know why), the
             servers were restarted. One host is still down, but the
        monitors on
             the 4 remaining servers are failing to enter a quorum.

             I managed to get a quorum of 3 monitors by stopping all Ceph
             monitors and OSDs across all machines, and bringing up the
        top 3
             ranked monitors in order of rank. After a few minutes, the
        60z0m02
             monitor (the top ranked one) became the leader:

             {
                  "name": "60z0m02",
                  "rank": 0,
                  "state": "leader",
                  "election_epoch": 11328,
                  "quorum": [
                      0,
                      1,
                      2
                  ],
                  "outside_quorum": [],
                  "extra_probe_peers": [],
                  "sync_provider": [],
                  "monmap": {
                      "epoch": 5,
                      "fsid": "2f51a247-3155-4bcf-9aee-c6f6b2c5e2af",
                      "modified": "2016-04-28 22:26:48.604393",
                      "created": "0.000000",
                      "mons": [
                          {
                              "rank": 0,
                              "name": "60z0m02",
                              "addr": "10.98.2.166:6789
        <http://10.98.2.166:6789> <http://10.98.2.166:6789>\/0"
                          },
                          {
                              "rank": 1,
                              "name": "60zxl02",
                              "addr": "10.98.2.167:6789
        <http://10.98.2.167:6789> <http://10.98.2.167:6789>\/0"
                          },
                          {
                              "rank": 2,
                              "name": "610wl02",
                              "addr": "10.98.2.173:6789
        <http://10.98.2.173:6789> <http://10.98.2.173:6789>\/0"
                          },
                          {
                              "rank": 3,
                              "name": "618yl02",
                              "addr": "10.98.2.214:6789
        <http://10.98.2.214:6789> <http://10.98.2.214:6789>\/0"
                          },
                          {
                              "rank": 4,
                              "name": "615yl02",
                              "addr": "10.98.2.216:6789
        <http://10.98.2.216:6789> <http://10.98.2.216:6789>\/0"

                          }
                      ]
                  }
             }

             The other 2 monitors became peons:

             "name": "60zxl02",
                  "rank": 1,
                  "state": "peon",
                  "election_epoch": 11328,
                  "quorum": [
                      0,
                      1,
                      2
                  ],

             "name": "610wl02",
                  "rank": 2,
                  "state": "peon",
                  "election_epoch": 11328,
                  "quorum": [
                      0,
                      1,
                      2
                  ],

             I then proceeded to start the fourth monitor, 615yl02
        (618yl02 is
             powered off), but after more than 2 hours and several election
             rounds, the monitors still haven't reached a quorum. The
        monitors
             alternate mostly between "election", "probing" states but
        they often
             seem to be in different election epochs.

             Is this normal?

             Is there anything I can do to help the monitors elect a leader?
             Should I manually remove the dead host's monitor from the
        monitor map?

             I left all OSD daemons stopped while the election is going on
             purpose. Is this the best thing to do? Would bringing the
        OSDs up
             help or complicate matters even more? Or doesn't it make
        any difference?

             I don't see anything obviously wrong in the monitor logs.
        They're
             mostly filled with messages like the following:

             2016-07-25 14:17:57.806148 7fc1b3f7e700  1
             mon.610wl02@2(electing).elector(11411) init, last seen
        epoch 11411
             2016-07-25 14:17:57.829198 7fc1b7caf700  0
        log_channel(audit) log
             [DBG] : from='admin socket' entity='admin socket'
        cmd='mon_status'
             args=[]: dispatch
             2016-07-25 14:17:57.829200 7fc1b7caf700  0
        log_channel(audit) do_log
             log to syslog
             2016-07-25 14:17:57.829254 7fc1b7caf700  0
        log_channel(audit) log
             [DBG] : from='admin socket' entity='admin socket'
        cmd=mon_status
             args=[]: finished

             Any help would be hugely appreciated.

             Thanks,

             Sergio

        _______________________________________________
        ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com