Re: Monitors not reaching quorum

Joao Eduardo Luis <joao@xxxxxxx> · Tue, 26 Jul 2016 18:48:06 +0100

On 07/26/2016 06:27 PM, Sergio A. de Carvalho Jr. wrote:
(Just realised I originally replied to Sean directly, so reposting here
for posterity).

Bingo!

wow. This didn't even cross my mind. D:

Thanks for sharing.

I turned off syslog and the monitors quickly reached quorum and
everything seems back to normal. Thanks so much, Sean.

Luckily this is a test cluster. I wonder how I could catch this in a
production cluster before our support engineers spend a day trying to
track the problem down.

Only way I can see to deal with this sort of thing would be to log to 
syslog on a separate thread and have said thread monitoring the latency 
when writing to syslog.

I don't think currently there's any support for that. I'll try to get 
something concocted this week, mostly for the fun of it.

  -Joao

Any ideas?

On Tue, Jul 26, 2016 at 12:28 PM, Sean Crosby
<richardnixonshead@xxxxxxxxx <mailto:richardnixonshead@xxxxxxxxx>> wrote:

    Hi Sergio,

    You don't happen to have rsyslog forwarding to a central log server
    by any chance? I've seen this behaviour before when my central log
    server is not keeping up with messages.

    Cheers,
    Sean

    On 26 July 2016 at 21:13, Sergio A. de Carvalho Jr.
    <scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>> wrote:

        I left the 4 nodes running overnight and they just crawled to
        their knees... to the point that nothing has been written to the
        logs in the last 11 hours. So I stopped all monitors this
        morning and started them one by one again, but they're are still
        being extremely slow. Here are their logs:

        <https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406><https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406><https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406>https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406
        https://gist.github.com/anonymous/f30a8903e701423825fd4d5aaa651e6a
        https://gist.github.com/anonymous/42a1856cc819de5b110d9f887e9859d2
        https://gist.github.com/anonymous/652bc41197e83a9d76cf5b2e6a211aa2

        I'm still puzzled to see logs being written with a timestamp
        that is several minutes behind the system clock. As time passes,
        the gap widens and quickly the logs are over 10 minutes behind
        the actual time, which explains why the logs  above don't seem
        to overlap.

        On Mon, Jul 25, 2016 at 9:37 PM, Sergio A. de Carvalho Jr.
        <scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>> wrote:

            Awesome, thanks so much, Joao.

            Here's the mon_status:
            https://gist.github.com/anonymous/2b80a9a75d134d9e539dfbc81615c055

            I'm still trying to collect the logs, but while doing that I
            noticed that the log records are severely delayed compared
            to the system clock. For example, watching the logs with
            tail -f, I see records with a timestamp that is up to 28
            minutes behind the system clock!

            Also, while trying to set debug level, the monitors
            sometimes hung for several minutes, so there's obviously
            something wrong with them.

            On Mon, Jul 25, 2016 at 6:16 PM, Joao Eduardo Luis
            <joao@xxxxxxx <mailto:joao@xxxxxxx>> wrote:

                On 07/25/2016 05:55 PM, Sergio A. de Carvalho Jr. wrote:

                    I just forced an NTP updated on all hosts to be sure
                    it's down to clock
                    skew. I also checked that hosts can reach all other
                    hosts on port 6789.

                    I then stopped monitor 0 (60z0m02) and started
                    monitor 1 (60zxl02), but
                    the 3 monitors left (1 - 60zxl02, 2 - 610wl02, 4 -
                    615yl02) were still
                    having problems to reach quorum. That led me to
                    believe monitor 4 was
                    the problem because I had a quorum before with
                    monitors 0, 1, 2.

                    So I stopped monitor 4 and started monitor 0 again,
                    but this time
                    monitors 0, 1, 2 failed to reach a quorum, which is
                    rather puzzling.

                    All hosts are pretty much idle all the time so I
                    can't see why monitors
                    would be getting stuck.

                Grab 'ceph daemon mon.<ID> mon_status' from all
                monitors, set 'debug mon = 10', 'debug paxos = 10',
                'debug ms = 1', grab logs from the monitors for the
                periods before, during and after election (ideally for
                at least two election cycles). Put them up somewhere on
                the interwebs and send us a link.

                If you don't feel comfortable putting that link in the
                list, send me an email directly with the url.

                I'll be happy to take a look later tonight.

                   -Joao

                    On Mon, Jul 25, 2016 at 5:18 PM, Joao Eduardo Luis
                    <joao@xxxxxxx <mailto:joao@xxxxxxx>
                    <mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>> wrote:

                         On 07/25/2016 04:34 PM, Sergio A. de Carvalho
                    Jr. wrote:

                             Thanks, Joao.

                             All monitors have the exact same mom map.

                             I suspect you're right that there might be
                    some communication
                             problem
                             though. I stopped monitor 1 (60zxl02), but
                    the other 3 monitors
                             still
                             failed to reach a quorum. I could see
                    monitor 0 was still declaring
                             victory but the others were always calling
                    for new elections:

                             2016-07-25 15:18:59.775144 7f8760af7700  0
                    log_channel(cluster) log
                             [INF] : mon.60z0m02@0 won leader election
                    with quorum 0,2,4

                             2016-07-25 15:18:54.702176 7fc1b357d700  1
                             mon.610wl02@2(electing) e5
                             handle_timecheck drop unexpected msg
                             2016-07-25 15:18:54.704526 7fc1b357d700  1
                             mon.610wl02@2(electing).data_health(11626)
                    service_dispatch not in
                             quorum -- drop message
                             2016-07-25 15:19:09.792511 <tel:09.792511>
                    <tel:09.792511 <tel:09.792511>> 7fc1b3f7e700  1
                             mon.610wl02@2(peon).paxos(paxos recovering
                    c 1318755..1319322)
                             lease_timeout -- calling new election
                             2016-07-25 15:19:09.792825 <tel:09.792825>
                    <tel:09.792825 <tel:09.792825>> 7fc1b357d700  0

                             log_channel(cluster) log
                             [INF] : mon.610wl02 calling new monitor
                    election

                             I'm curious about the "handle_timecheck
                    drop unexpected msg"
                             message.

                         timechecks (i.e., checking for clock skew), as
                    well as the
                         data_health service (which makes sure you have
                    enough disk space in
                         the mon data dir) are only run when you have a
                    quorum. If a message
                         is received by a monitor not in a quorum,
                    regardless of state, it
                         will be dropped.

                         Assuming you know took one of the
                    self-appointed leaders out - by
                         shutting it down, for instance -, you should
                    now check what's
                         causing elections not to hold.

                         In these cases, assuming your 3 monitors do
                    form a quorum, the
                         traditional issue tends to be 'lease timeouts'.
                    I.e., the leader
                         fails to provide a lease extension on paxos for
                    the peons, and the
                         peons assume the leader failed in some form
                    (unresponsive, down,
                         whatever).

                         Above it does seem a lease timeout was
                    triggered on a peon. This may
                         have happened because:

                         1. leader did not extend the lease
                         2. leader did extend the lease but lease was in
                    the past - usually
                         indication of a clock skew on the leader, on
                    the peons, or both.
                         3. leader did extend the lease, it was with the
                    correct time but
                         peon failed to dispatch the message on time.

                         Both 1. and 2. may be due to several factors,
                    but most commonly it's
                         because the monitor was stuck doing something.
                    This something, more
                         often than not, is leveldb. If this is the
                    case, check the size of
                         your leveldb. If it is over 5 or 6GB in size,
                    you may need to
                         manually compact the store (mon compact on
                    start = true, iirc).

                         HTH

                            -Joao

                             On Mon, Jul 25, 2016 at 4:10 PM, Joao
                    Eduardo Luis <joao@xxxxxxx <mailto:joao@xxxxxxx>
                             <mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>
                             <mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>
                    <mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>>> wrote:

                                  On 07/25/2016 03:41 PM, Sergio A. de
                    Carvalho Jr. wrote:

                                      In the logs, there 2 monitors are
                    constantly reporting
                             that they
                                      won the
                                      leader election:

                                      60z0m02 (monitor 0):
                                      2016-07-25 14:31:11.644335
                    7f8760af7700  0
                             log_channel(cluster) log
                                      [INF] : mon.60z0m02@0 won leader
                    election with quorum 0,2,4
                                      2016-07-25 14:31:44.521552
                    7f8760af7700  1
                                      mon.60z0m02@0(leader).paxos(paxos
                    recovering c
                             1318755..1319320)
                                      collect
                                      timeout, calling fresh election

                                      60zxl02 (monitor 1):
                                      2016-07-25 14:31:59.542346
                    7fefdeaed700  1

                      mon.60zxl02@1(electing).elector(11441) init, last seen
                             epoch 11441
                                      2016-07-25 14:32:04.583929
                    7fefdf4ee700  0
                             log_channel(cluster) log
                                      [INF] : mon.60zxl02@1 won leader
                    election with quorum 1,2,4
                                      2016-07-25 14:32:33.440103
                    7fefdf4ee700  1
                                      mon.60zxl02@1(leader).paxos(paxos
                    recovering c
                             1318755..1319319)
                                      collect
                                      timeout, calling fresh election

                                  There are two likely scenarios to
                    explain this:

                                  1. The monitors have different
                    monitors in their monmaps - this
                                  could happen if you didn't add the new
                    monitor via 'ceph
                             mon add'.
                                  You can check this by running 'ceph
                    daemon mon.<ID>
                             mon_status' for
                                  each of the monitors in the cluster.

                                  2. some of the monitors are unable to
                    communicate with each
                             other,
                                  thus will never acknowledge the same
                    leader. This does not
                             mean you
                                  have two leaders for the same cluster,
                    but it does mean
                             that you
                                  will end up having two monitors
                    declaring victory and
                             become the
                                  self-proclaimed leader in the cluster.
                    The peons should
                             still only
                                  belong to one quorum.

                                  If this does not help you, try setting
                    'debug mon = 10' and
                             'debug
                                  ms = 1' on the monitors and check the
                    logs, making sure the
                             monitors
                                  get the probes and follow the election
                    process. If you need
                             further
                                  assistance, put those logs online
                    somewhere we can access
                             them and
                                  we'll try to help you out.

                                     -Joao

                                      On Mon, Jul 25, 2016 at 3:27 PM,
                    Sergio A. de Carvalho Jr.
                                      <scarvalhojr@xxxxxxxxx
                    <mailto:scarvalhojr@xxxxxxxxx>
                    <mailto:scarvalhojr@xxxxxxxxx
                    <mailto:scarvalhojr@xxxxxxxxx>>
                             <mailto:scarvalhojr@xxxxxxxxx
                    <mailto:scarvalhojr@xxxxxxxxx>
                    <mailto:scarvalhojr@xxxxxxxxx
                    <mailto:scarvalhojr@xxxxxxxxx>>>
                                      <mailto:scarvalhojr@xxxxxxxxx
                    <mailto:scarvalhojr@xxxxxxxxx>
                             <mailto:scarvalhojr@xxxxxxxxx
                    <mailto:scarvalhojr@xxxxxxxxx>>
                    <mailto:scarvalhojr@xxxxxxxxx
                    <mailto:scarvalhojr@xxxxxxxxx>
                             <mailto:scarvalhojr@xxxxxxxxx
                    <mailto:scarvalhojr@xxxxxxxxx>>>>>

                                      wrote:

                                           Hi,

                                           I have a cluster of 5 hosts
                    running Ceph 0.94.6 on
                             CentOS
                                      6.5. On
                                           each host, there is 1 monitor
                    and 13 OSDs. We had
                             an issue
                                      with the
                                           network and for some reason
                    (which I still don't
                             know why), the
                                           servers were restarted. One
                    host is still down,
                             but the
                                      monitors on
                                           the 4 remaining servers are
                    failing to enter a quorum.

                                           I managed to get a quorum of
                    3 monitors by
                             stopping all Ceph
                                           monitors and OSDs across all
                    machines, and
                             bringing up the
                                      top 3
                                           ranked monitors in order of
                    rank. After a few
                             minutes, the
                                      60z0m02
                                           monitor (the top ranked one)
                    became the leader:

                                           {
                                                "name": "60z0m02",
                                                "rank": 0,
                                                "state": "leader",
                                                "election_epoch": 11328,
                                                "quorum": [
                                                    0,
                                                    1,
                                                    2
                                                ],
                                                "outside_quorum": [],
                                                "extra_probe_peers": [],
                                                "sync_provider": [],
                                                "monmap": {
                                                    "epoch": 5,
                                                    "fsid":
                             "2f51a247-3155-4bcf-9aee-c6f6b2c5e2af",
                                                    "modified":
                    "2016-04-28 22:26:48.604393",
                                                    "created": "0.000000",
                                                    "mons": [
                                                        {
                                                            "rank": 0,
                                                            "name":
                    "60z0m02",
                                                            "addr":
                    "10.98.2.166:6789 <http://10.98.2.166:6789>
                             <http://10.98.2.166:6789>
                                      <http://10.98.2.166:6789>
                    <http://10.98.2.166:6789>\/0"
                                                        },
                                                        {
                                                            "rank": 1,
                                                            "name":
                    "60zxl02",
                                                            "addr":
                    "10.98.2.167:6789 <http://10.98.2.167:6789>
                             <http://10.98.2.167:6789>
                                      <http://10.98.2.167:6789>
                    <http://10.98.2.167:6789>\/0"
                                                        },
                                                        {
                                                            "rank": 2,
                                                            "name":
                    "610wl02",
                                                            "addr":
                    "10.98.2.173:6789 <http://10.98.2.173:6789>
                             <http://10.98.2.173:6789>
                                      <http://10.98.2.173:6789>
                    <http://10.98.2.173:6789>\/0"
                                                        },
                                                        {
                                                            "rank": 3,
                                                            "name":
                    "618yl02",
                                                            "addr":
                    "10.98.2.214:6789 <http://10.98.2.214:6789>
                             <http://10.98.2.214:6789>
                                      <http://10.98.2.214:6789>
                    <http://10.98.2.214:6789>\/0"
                                                        },
                                                        {
                                                            "rank": 4,
                                                            "name":
                    "615yl02",
                                                            "addr":
                    "10.98.2.216:6789 <http://10.98.2.216:6789>
                             <http://10.98.2.216:6789>
                                      <http://10.98.2.216:6789>
                    <http://10.98.2.216:6789>\/0"

                                                        }
                                                    ]
                                                }
                                           }

                                           The other 2 monitors became
                    peons:

                                           "name": "60zxl02",
                                                "rank": 1,
                                                "state": "peon",
                                                "election_epoch": 11328,
                                                "quorum": [
                                                    0,
                                                    1,
                                                    2
                                                ],

                                           "name": "610wl02",
                                                "rank": 2,
                                                "state": "peon",
                                                "election_epoch": 11328,
                                                "quorum": [
                                                    0,
                                                    1,
                                                    2
                                                ],

                                           I then proceeded to start the
                    fourth monitor, 615yl02
                                      (618yl02 is
                                           powered off), but after more
                    than 2 hours and
                             several election
                                           rounds, the monitors still
                    haven't reached a
                             quorum. The
                                      monitors
                                           alternate mostly between
                    "election", "probing"
                             states but
                                      they often
                                           seem to be in different
                    election epochs.

                                           Is this normal?

                                           Is there anything I can do to
                    help the monitors
                             elect a leader?
                                           Should I manually remove the
                    dead host's monitor
                             from the
                                      monitor map?

                                           I left all OSD daemons
                    stopped while the election
                             is going on
                                           purpose. Is this the best
                    thing to do? Would
                             bringing the
                                      OSDs up
                                           help or complicate matters
                    even more? Or doesn't
                             it make
                                      any difference?

                                           I don't see anything
                    obviously wrong in the
                             monitor logs.
                                      They're
                                           mostly filled with messages
                    like the following:

                                           2016-07-25 14:17:57.806148
                    7fc1b3f7e700  1

                    mon.610wl02@2(electing).elector(11411) init, last seen
                                      epoch 11411
                                           2016-07-25 14:17:57.829198
                    7fc1b7caf700  0
                                      log_channel(audit) log
                                           [DBG] : from='admin socket'
                    entity='admin socket'
                                      cmd='mon_status'
                                           args=[]: dispatch
                                           2016-07-25 14:17:57.829200
                    7fc1b7caf700  0
                                      log_channel(audit) do_log
                                           log to syslog
                                           2016-07-25 14:17:57.829254
                    7fc1b7caf700  0
                                      log_channel(audit) log
                                           [DBG] : from='admin socket'
                    entity='admin socket'
                                      cmd=mon_status
                                           args=[]: finished

                                           Any help would be hugely
                    appreciated.

                                           Thanks,

                                           Sergio

                      _______________________________________________
                                      ceph-users mailing list
                    ceph-users@xxxxxxxxxxxxxx
                    <mailto:ceph-users@xxxxxxxxxxxxxx>
                    <mailto:ceph-users@xxxxxxxxxxxxxx
                    <mailto:ceph-users@xxxxxxxxxxxxxx>>
                             <mailto:ceph-users@xxxxxxxxxxxxxx
                    <mailto:ceph-users@xxxxxxxxxxxxxx>
                             <mailto:ceph-users@xxxxxxxxxxxxxx
                    <mailto:ceph-users@xxxxxxxxxxxxxx>>>
                    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                      _______________________________________________
                                  ceph-users mailing list
                    ceph-users@xxxxxxxxxxxxxx
                    <mailto:ceph-users@xxxxxxxxxxxxxx>
                    <mailto:ceph-users@xxxxxxxxxxxxxx
                    <mailto:ceph-users@xxxxxxxxxxxxxx>>
                             <mailto:ceph-users@xxxxxxxxxxxxxx
                    <mailto:ceph-users@xxxxxxxxxxxxxx>
                             <mailto:ceph-users@xxxxxxxxxxxxxx
                    <mailto:ceph-users@xxxxxxxxxxxxxx>>>
                    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        _______________________________________________
        ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com