Re: Monitors not reaching quorum

Joao Eduardo Luis <joao@xxxxxxx> · Tue, 26 Jul 2016 13:21:53 +0100

On 07/26/2016 12:13 PM, Sergio A. de Carvalho Jr. wrote:
I left the 4 nodes running overnight and they just crawled to their
knees... to the point that nothing has been written to the logs in the
last 11 hours. So I stopped all monitors this morning and started them
one by one again, but they're are still being extremely slow. Here are
their logs:

<https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406><https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406><https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406>https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406
https://gist.github.com/anonymous/f30a8903e701423825fd4d5aaa651e6a
https://gist.github.com/anonymous/42a1856cc819de5b110d9f887e9859d2
https://gist.github.com/anonymous/652bc41197e83a9d76cf5b2e6a211aa2

I'm still puzzled to see logs being written with a timestamp that is
several minutes behind the system clock. As time passes, the gap widens
and quickly the logs are over 10 minutes behind the actual time, which
explains why the logs  above don't seem to overlap.

There's something fishy going on with the elected leader (60zxl02).

As per the excerpt below (if it doesn't get too mangled by wrapping), 
you can see that it ends up spending a lot of time establishing a 
connection with a client.

Only after spending roughly 26 seconds, does it initiate paxos recovery. 
By the time the peons receive the collect messages from the leader, 
their timeouts would have expired roughly 16 seconds prior (give or take).

2016-07-26 10:38:11.808417 7fc499a9c700  1 -- 10.98.2.167:6789/0 <== 
client.? 10.98.3.15:0/4187059021 1 ==== auth(proto 0 31 bytes epoch 0) 
v1 ==== 61+0+0 (535495083 0 0) 0x388e300 con 0x33f40a0
2016-07-26 10:38:28.708416 7fc497b98700  0 -- 10.98.2.167:6789/0 >> 
10.98.2.214:6789/0 pipe(0x37a8000 sd=20 :0 s=1 pgs=0 cs=0 l=0 
c=0x33f1ce0).fault
2016-07-26 10:38:32.810085 7fc497794700  1 -- 10.98.2.167:6789/0 >> :/0 
pipe(0x3624800 sd=20 :6789 s=0 pgs=0 cs=0 l=0 c=0x33f4780).accept sd=20 
10.98.3.15:44824/0
2016-07-26 10:38:32.810163 7fc497794700 10 mon.60zxl02@1(leader) e5 
ms_verify_authorizer 10.98.3.15:0/4187059021 client protocol 0
2016-07-26 10:38:37.711864 7fc49a49d700  1 -- 10.98.2.167:6789/0 --> 
10.98.2.167:6789/0 -- log(1 entries) v1 -- ?+0 0x388de80 con 0x33f11e0
2016-07-26 10:38:37.711876 7fc49a49d700 10 
mon.60zxl02@1(leader).paxos(paxos recovering c 1318755..1319324) 
leader_init -- starting paxos recovery
2016-07-26 10:38:37.739541 7fc49a49d700 10 
mon.60zxl02@1(leader).paxos(paxos recovering c 1318755..1319324) 
get_new_proposal_number = 470401
2016-07-26 10:38:37.739548 7fc49a49d700 10 
mon.60zxl02@1(leader).paxos(paxos recovering c 1318755..1319324) collect 
with pn 470401
2016-07-26 10:38:37.739554 7fc49a49d700  1 -- 10.98.2.167:6789/0 --> 
mon.2 10.98.2.173:6789/0 -- paxos(collect lc 1319324 fc 1318755 pn 
470401 opn 0) v3 -- ?+0 0x34fa800
2016-07-26 10:38:37.739566 7fc49a49d700  1 -- 10.98.2.167:6789/0 --> 
mon.4 10.98.2.216:6789/0 -- paxos(collect lc 1319324 fc 1318755 pn 
470401 opn 0) v3 -- ?+0 0x34fcd80

Also, we don't have enough logs to ascertain what's going on with your 
other monitor 60z0m02, but it seems it is never participating in 
elections or answering to probes, so I'm inclined to assume it was 
either shut down when these elections happened or it is simply unable to 
communicate with the other monitors (even though you mentioned you 
tested communication between nodes on 6789).

My suggestion at this point is to try to form a quorum with just 
mon.610wl02 and 615yl02, as they seem the ones behaving properly. Then 
add a new monitor from the remaining set and debug as you go.

To accomplish this you will have to modify the monmaps. I recommend 
taking a look at [1], but put simply what you want is to shut down all 
your monitors, extract the monmap from one of the monitors, modify it so 
that you only have the two intended monitors on the map, and then 
reinject said map on the two monitors.

Once they form quorum, you can add new monitors to the cluster.

You won't be able to add the other monitors you currently have in their 
present form (as they have marked themselves as having been a quorum, 
and there are safeguards preventing them from joining). You will need to 
recreate those monitors.

Then again, my bet is on something weird happening in your network, I 
just can't really tell what.

  -Joao

[1] 
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/#removing-monitors-from-an-unhealthy-cluster

On Mon, Jul 25, 2016 at 9:37 PM, Sergio A. de Carvalho Jr.
<scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>> wrote:

    Awesome, thanks so much, Joao.

    Here's the mon_status:
    https://gist.github.com/anonymous/2b80a9a75d134d9e539dfbc81615c055

    I'm still trying to collect the logs, but while doing that I noticed
    that the log records are severely delayed compared to the system
    clock. For example, watching the logs with tail -f, I see records
    with a timestamp that is up to 28 minutes behind the system clock!

    Also, while trying to set debug level, the monitors sometimes hung
    for several minutes, so there's obviously something wrong with them.

    On Mon, Jul 25, 2016 at 6:16 PM, Joao Eduardo Luis <joao@xxxxxxx
    <mailto:joao@xxxxxxx>> wrote:

        On 07/25/2016 05:55 PM, Sergio A. de Carvalho Jr. wrote:

            I just forced an NTP updated on all hosts to be sure it's
            down to clock
            skew. I also checked that hosts can reach all other hosts on
            port 6789.

            I then stopped monitor 0 (60z0m02) and started monitor 1
            (60zxl02), but
            the 3 monitors left (1 - 60zxl02, 2 - 610wl02, 4 - 615yl02)
            were still
            having problems to reach quorum. That led me to believe
            monitor 4 was
            the problem because I had a quorum before with monitors 0, 1, 2.

            So I stopped monitor 4 and started monitor 0 again, but this
            time
            monitors 0, 1, 2 failed to reach a quorum, which is rather
            puzzling.

            All hosts are pretty much idle all the time so I can't see
            why monitors
            would be getting stuck.

        Grab 'ceph daemon mon.<ID> mon_status' from all monitors, set
        'debug mon = 10', 'debug paxos = 10', 'debug ms = 1', grab logs
        from the monitors for the periods before, during and after
        election (ideally for at least two election cycles). Put them up
        somewhere on the interwebs and send us a link.

        If you don't feel comfortable putting that link in the list,
        send me an email directly with the url.

        I'll be happy to take a look later tonight.

           -Joao

            On Mon, Jul 25, 2016 at 5:18 PM, Joao Eduardo Luis
            <joao@xxxxxxx <mailto:joao@xxxxxxx>
            <mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>> wrote:

                 On 07/25/2016 04:34 PM, Sergio A. de Carvalho Jr. wrote:

                     Thanks, Joao.

                     All monitors have the exact same mom map.

                     I suspect you're right that there might be some
            communication
                     problem
                     though. I stopped monitor 1 (60zxl02), but the
            other 3 monitors
                     still
                     failed to reach a quorum. I could see monitor 0 was
            still declaring
                     victory but the others were always calling for new
            elections:

                     2016-07-25 15:18:59.775144 7f8760af7700  0
            log_channel(cluster) log
                     [INF] : mon.60z0m02@0 won leader election with
            quorum 0,2,4

                     2016-07-25 15:18:54.702176 7fc1b357d700  1
                     mon.610wl02@2(electing) e5
                     handle_timecheck drop unexpected msg
                     2016-07-25 15:18:54.704526 7fc1b357d700  1
                     mon.610wl02@2(electing).data_health(11626)
            service_dispatch not in
                     quorum -- drop message
                     2016-07-25 15:19:09.792511 <tel:09.792511>
            <tel:09.792511 <tel:09.792511>> 7fc1b3f7e700  1
                     mon.610wl02@2(peon).paxos(paxos recovering c
            1318755..1319322)
                     lease_timeout -- calling new election
                     2016-07-25 15:19:09.792825 <tel:09.792825>
            <tel:09.792825 <tel:09.792825>> 7fc1b357d700  0

                     log_channel(cluster) log
                     [INF] : mon.610wl02 calling new monitor election

                     I'm curious about the "handle_timecheck drop
            unexpected msg"
                     message.

                 timechecks (i.e., checking for clock skew), as well as the
                 data_health service (which makes sure you have enough
            disk space in
                 the mon data dir) are only run when you have a quorum.
            If a message
                 is received by a monitor not in a quorum, regardless of
            state, it
                 will be dropped.

                 Assuming you know took one of the self-appointed
            leaders out - by
                 shutting it down, for instance -, you should now check
            what's
                 causing elections not to hold.

                 In these cases, assuming your 3 monitors do form a
            quorum, the
                 traditional issue tends to be 'lease timeouts'. I.e.,
            the leader
                 fails to provide a lease extension on paxos for the
            peons, and the
                 peons assume the leader failed in some form
            (unresponsive, down,
                 whatever).

                 Above it does seem a lease timeout was triggered on a
            peon. This may
                 have happened because:

                 1. leader did not extend the lease
                 2. leader did extend the lease but lease was in the
            past - usually
                 indication of a clock skew on the leader, on the peons,
            or both.
                 3. leader did extend the lease, it was with the correct
            time but
                 peon failed to dispatch the message on time.

                 Both 1. and 2. may be due to several factors, but most
            commonly it's
                 because the monitor was stuck doing something. This
            something, more
                 often than not, is leveldb. If this is the case, check
            the size of
                 your leveldb. If it is over 5 or 6GB in size, you may
            need to
                 manually compact the store (mon compact on start =
            true, iirc).

                 HTH

                    -Joao

                     On Mon, Jul 25, 2016 at 4:10 PM, Joao Eduardo Luis
            <joao@xxxxxxx <mailto:joao@xxxxxxx>
                     <mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>
                     <mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>
            <mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>>> wrote:

                          On 07/25/2016 03:41 PM, Sergio A. de Carvalho
            Jr. wrote:

                              In the logs, there 2 monitors are
            constantly reporting
                     that they
                              won the
                              leader election:

                              60z0m02 (monitor 0):
                              2016-07-25 14:31:11.644335 7f8760af7700  0
                     log_channel(cluster) log
                              [INF] : mon.60z0m02@0 won leader election
            with quorum 0,2,4
                              2016-07-25 14:31:44.521552 7f8760af7700  1
                              mon.60z0m02@0(leader).paxos(paxos recovering c
                     1318755..1319320)
                              collect
                              timeout, calling fresh election

                              60zxl02 (monitor 1):
                              2016-07-25 14:31:59.542346 7fefdeaed700  1
                              mon.60zxl02@1(electing).elector(11441)
            init, last seen
                     epoch 11441
                              2016-07-25 14:32:04.583929 7fefdf4ee700  0
                     log_channel(cluster) log
                              [INF] : mon.60zxl02@1 won leader election
            with quorum 1,2,4
                              2016-07-25 14:32:33.440103 7fefdf4ee700  1
                              mon.60zxl02@1(leader).paxos(paxos recovering c
                     1318755..1319319)
                              collect
                              timeout, calling fresh election

                          There are two likely scenarios to explain this:

                          1. The monitors have different monitors in
            their monmaps - this
                          could happen if you didn't add the new monitor
            via 'ceph
                     mon add'.
                          You can check this by running 'ceph daemon
            mon.<ID>
                     mon_status' for
                          each of the monitors in the cluster.

                          2. some of the monitors are unable to
            communicate with each
                     other,
                          thus will never acknowledge the same leader.
            This does not
                     mean you
                          have two leaders for the same cluster, but it
            does mean
                     that you
                          will end up having two monitors declaring
            victory and
                     become the
                          self-proclaimed leader in the cluster. The
            peons should
                     still only
                          belong to one quorum.

                          If this does not help you, try setting 'debug
            mon = 10' and
                     'debug
                          ms = 1' on the monitors and check the logs,
            making sure the
                     monitors
                          get the probes and follow the election
            process. If you need
                     further
                          assistance, put those logs online somewhere we
            can access
                     them and
                          we'll try to help you out.

                             -Joao

                              On Mon, Jul 25, 2016 at 3:27 PM, Sergio A.
            de Carvalho Jr.
                              <scarvalhojr@xxxxxxxxx
            <mailto:scarvalhojr@xxxxxxxxx> <mailto:scarvalhojr@xxxxxxxxx
            <mailto:scarvalhojr@xxxxxxxxx>>
                     <mailto:scarvalhojr@xxxxxxxxx
            <mailto:scarvalhojr@xxxxxxxxx> <mailto:scarvalhojr@xxxxxxxxx
            <mailto:scarvalhojr@xxxxxxxxx>>>
                              <mailto:scarvalhojr@xxxxxxxxx
            <mailto:scarvalhojr@xxxxxxxxx>
                     <mailto:scarvalhojr@xxxxxxxxx
            <mailto:scarvalhojr@xxxxxxxxx>>
            <mailto:scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>
                     <mailto:scarvalhojr@xxxxxxxxx
            <mailto:scarvalhojr@xxxxxxxxx>>>>>

                              wrote:

                                   Hi,

                                   I have a cluster of 5 hosts running
            Ceph 0.94.6 on
                     CentOS
                              6.5. On
                                   each host, there is 1 monitor and 13
            OSDs. We had
                     an issue
                              with the
                                   network and for some reason (which I
            still don't
                     know why), the
                                   servers were restarted. One host is
            still down,
                     but the
                              monitors on
                                   the 4 remaining servers are failing
            to enter a quorum.

                                   I managed to get a quorum of 3
            monitors by
                     stopping all Ceph
                                   monitors and OSDs across all
            machines, and
                     bringing up the
                              top 3
                                   ranked monitors in order of rank.
            After a few
                     minutes, the
                              60z0m02
                                   monitor (the top ranked one) became
            the leader:

                                   {
                                        "name": "60z0m02",
                                        "rank": 0,
                                        "state": "leader",
                                        "election_epoch": 11328,
                                        "quorum": [
                                            0,
                                            1,
                                            2
                                        ],
                                        "outside_quorum": [],
                                        "extra_probe_peers": [],
                                        "sync_provider": [],
                                        "monmap": {
                                            "epoch": 5,
                                            "fsid":
                     "2f51a247-3155-4bcf-9aee-c6f6b2c5e2af",
                                            "modified": "2016-04-28
            22:26:48.604393",
                                            "created": "0.000000",
                                            "mons": [
                                                {
                                                    "rank": 0,
                                                    "name": "60z0m02",
                                                    "addr":
            "10.98.2.166:6789 <http://10.98.2.166:6789>
                     <http://10.98.2.166:6789>
                              <http://10.98.2.166:6789>
            <http://10.98.2.166:6789>\/0"
                                                },
                                                {
                                                    "rank": 1,
                                                    "name": "60zxl02",
                                                    "addr":
            "10.98.2.167:6789 <http://10.98.2.167:6789>
                     <http://10.98.2.167:6789>
                              <http://10.98.2.167:6789>
            <http://10.98.2.167:6789>\/0"
                                                },
                                                {
                                                    "rank": 2,
                                                    "name": "610wl02",
                                                    "addr":
            "10.98.2.173:6789 <http://10.98.2.173:6789>
                     <http://10.98.2.173:6789>
                              <http://10.98.2.173:6789>
            <http://10.98.2.173:6789>\/0"
                                                },
                                                {
                                                    "rank": 3,
                                                    "name": "618yl02",
                                                    "addr":
            "10.98.2.214:6789 <http://10.98.2.214:6789>
                     <http://10.98.2.214:6789>
                              <http://10.98.2.214:6789>
            <http://10.98.2.214:6789>\/0"
                                                },
                                                {
                                                    "rank": 4,
                                                    "name": "615yl02",
                                                    "addr":
            "10.98.2.216:6789 <http://10.98.2.216:6789>
                     <http://10.98.2.216:6789>
                              <http://10.98.2.216:6789>
            <http://10.98.2.216:6789>\/0"

                                                }
                                            ]
                                        }
                                   }

                                   The other 2 monitors became peons:

                                   "name": "60zxl02",
                                        "rank": 1,
                                        "state": "peon",
                                        "election_epoch": 11328,
                                        "quorum": [
                                            0,
                                            1,
                                            2
                                        ],

                                   "name": "610wl02",
                                        "rank": 2,
                                        "state": "peon",
                                        "election_epoch": 11328,
                                        "quorum": [
                                            0,
                                            1,
                                            2
                                        ],

                                   I then proceeded to start the fourth
            monitor, 615yl02
                              (618yl02 is
                                   powered off), but after more than 2
            hours and
                     several election
                                   rounds, the monitors still haven't
            reached a
                     quorum. The
                              monitors
                                   alternate mostly between "election",
            "probing"
                     states but
                              they often
                                   seem to be in different election epochs.

                                   Is this normal?

                                   Is there anything I can do to help
            the monitors
                     elect a leader?
                                   Should I manually remove the dead
            host's monitor
                     from the
                              monitor map?

                                   I left all OSD daemons stopped while
            the election
                     is going on
                                   purpose. Is this the best thing to
            do? Would
                     bringing the
                              OSDs up
                                   help or complicate matters even more?
            Or doesn't
                     it make
                              any difference?

                                   I don't see anything obviously wrong
            in the
                     monitor logs.
                              They're
                                   mostly filled with messages like the
            following:

                                   2016-07-25 14:17:57.806148
            7fc1b3f7e700  1

            mon.610wl02@2(electing).elector(11411) init, last seen
                              epoch 11411
                                   2016-07-25 14:17:57.829198
            7fc1b7caf700  0
                              log_channel(audit) log
                                   [DBG] : from='admin socket'
            entity='admin socket'
                              cmd='mon_status'
                                   args=[]: dispatch
                                   2016-07-25 14:17:57.829200
            7fc1b7caf700  0
                              log_channel(audit) do_log
                                   log to syslog
                                   2016-07-25 14:17:57.829254
            7fc1b7caf700  0
                              log_channel(audit) log
                                   [DBG] : from='admin socket'
            entity='admin socket'
                              cmd=mon_status
                                   args=[]: finished

                                   Any help would be hugely appreciated.

                                   Thanks,

                                   Sergio

              _______________________________________________
                              ceph-users mailing list
            ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
            <mailto:ceph-users@xxxxxxxxxxxxxx
            <mailto:ceph-users@xxxxxxxxxxxxxx>>
                     <mailto:ceph-users@xxxxxxxxxxxxxx
            <mailto:ceph-users@xxxxxxxxxxxxxx>
                     <mailto:ceph-users@xxxxxxxxxxxxxx
            <mailto:ceph-users@xxxxxxxxxxxxxx>>>
            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                          _______________________________________________
                          ceph-users mailing list
            ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
            <mailto:ceph-users@xxxxxxxxxxxxxx
            <mailto:ceph-users@xxxxxxxxxxxxxx>>
                     <mailto:ceph-users@xxxxxxxxxxxxxx
            <mailto:ceph-users@xxxxxxxxxxxxxx>
                     <mailto:ceph-users@xxxxxxxxxxxxxx
            <mailto:ceph-users@xxxxxxxxxxxxxx>>>
            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com