Re: Monitors not reaching quorum

"Sergio A. de Carvalho Jr." <scarvalhojr@xxxxxxxxx> · Tue, 26 Jul 2016 18:27:56 +0100

(Just realised I originally replied to Sean directly, so reposting here for posterity).

Bingo!
I turned off syslog and the monitors quickly reached quorum and everything seems back to normal. Thanks so much, Sean.

Luckily this is a test cluster. I wonder how I could catch this in a production cluster before our support engineers spend a day trying to track the problem down.

Any ideas?

On Tue, Jul 26, 2016 at 12:28 PM, Sean Crosby <richardnixonshead@xxxxxxxxx> wrote:
Hi Sergio,
You don't happen to have rsyslog forwarding to a central log server by any chance? I've seen this behaviour before when my central log server is not keeping up with messages.

Cheers,
Sean

On 26 July 2016 at 21:13, Sergio A. de Carvalho Jr. <scarvalhojr@xxxxxxxxx> wrote:
I left the 4 nodes running overnight and they just crawled to their knees... to the point that nothing has been written to the logs in the last 11 hours. So I stopped all monitors this morning and started them one by one again, but they're are still being extremely slow. Here are their logs:
https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406
https://gist.github.com/anonymous/f30a8903e701423825fd4d5aaa651e6a
https://gist.github.com/anonymous/42a1856cc819de5b110d9f887e9859d2
https://gist.github.com/anonymous/652bc41197e83a9d76cf5b2e6a211aa2

I'm still puzzled to see logs being written with a timestamp that is several minutes behind the system clock. As time passes, the gap widens and quickly the logs are over 10 minutes behind the actual time, which explains why the logs  above don't seem to overlap.

On Mon, Jul 25, 2016 at 9:37 PM, Sergio A. de Carvalho Jr. <scarvalhojr@xxxxxxxxx> wrote:
Awesome, thanks so much, Joao.
Here's the mon_status:
https://gist.github.com/anonymous/2b80a9a75d134d9e539dfbc81615c055

I'm still trying to collect the logs, but while doing that I noticed that the log records are severely delayed compared to the system clock. For example, watching the logs with tail -f, I see records with a timestamp that is up to 28 minutes behind the system clock!

Also, while trying to set debug level, the monitors sometimes hung for several minutes, so there's obviously something wrong with them.

On Mon, Jul 25, 2016 at 6:16 PM, Joao Eduardo Luis <joao@xxxxxxx> wrote:
On 07/25/2016 05:55 PM, Sergio A. de Carvalho Jr. wrote:

I just forced an NTP updated on all hosts to be sure it's down to clock

skew. I also checked that hosts can reach all other hosts on port 6789.

I then stopped monitor 0 (60z0m02) and started monitor 1 (60zxl02), but

the 3 monitors left (1 - 60zxl02, 2 - 610wl02, 4 - 615yl02) were still

having problems to reach quorum. That led me to believe monitor 4 was

the problem because I had a quorum before with monitors 0, 1, 2.

So I stopped monitor 4 and started monitor 0 again, but this time

monitors 0, 1, 2 failed to reach a quorum, which is rather puzzling.

All hosts are pretty much idle all the time so I can't see why monitors

would be getting stuck.

Grab 'ceph daemon mon.<ID> mon_status' from all monitors, set 'debug mon = 10', 'debug paxos = 10', 'debug ms = 1', grab logs from the monitors for the periods before, during and after election (ideally for at least two election cycles). Put them up somewhere on the interwebs and send us a link.

If you don't feel comfortable putting that link in the list, send me an email directly with the url.

I'll be happy to take a look later tonight.

  -Joao

On Mon, Jul 25, 2016 at 5:18 PM, Joao Eduardo Luis <joao@xxxxxxx

<mailto:joao@xxxxxxx>> wrote:

    On 07/25/2016 04:34 PM, Sergio A. de Carvalho Jr. wrote:

        Thanks, Joao.

        All monitors have the exact same mom map.

        I suspect you're right that there might be some communication

        problem

        though. I stopped monitor 1 (60zxl02), but the other 3 monitors

        still

        failed to reach a quorum. I could see monitor 0 was still declaring

        victory but the others were always calling for new elections:

        2016-07-25 15:18:59.775144 7f8760af7700  0 log_channel(cluster) log

        [INF] : mon.60z0m02@0 won leader election with quorum 0,2,4

        2016-07-25 15:18:54.702176 7fc1b357d700  1

        mon.610wl02@2(electing) e5

        handle_timecheck drop unexpected msg

        2016-07-25 15:18:54.704526 7fc1b357d700  1

        mon.610wl02@2(electing).data_health(11626) service_dispatch not in

        quorum -- drop message

        2016-07-25 15:19:09.792511 <tel:09.792511> 7fc1b3f7e700  1

        mon.610wl02@2(peon).paxos(paxos recovering c 1318755..1319322)

        lease_timeout -- calling new election

        2016-07-25 15:19:09.792825 <tel:09.792825> 7fc1b357d700  0

        log_channel(cluster) log

        [INF] : mon.610wl02 calling new monitor election

        I'm curious about the "handle_timecheck drop unexpected msg"

        message.

    timechecks (i.e., checking for clock skew), as well as the

    data_health service (which makes sure you have enough disk space in

    the mon data dir) are only run when you have a quorum. If a message

    is received by a monitor not in a quorum, regardless of state, it

    will be dropped.

    Assuming you know took one of the self-appointed leaders out - by

    shutting it down, for instance -, you should now check what's

    causing elections not to hold.

    In these cases, assuming your 3 monitors do form a quorum, the

    traditional issue tends to be 'lease timeouts'. I.e., the leader

    fails to provide a lease extension on paxos for the peons, and the

    peons assume the leader failed in some form (unresponsive, down,

    whatever).

    Above it does seem a lease timeout was triggered on a peon. This may

    have happened because:

    1. leader did not extend the lease

    2. leader did extend the lease but lease was in the past - usually

    indication of a clock skew on the leader, on the peons, or both.

    3. leader did extend the lease, it was with the correct time but

    peon failed to dispatch the message on time.

    Both 1. and 2. may be due to several factors, but most commonly it's

    because the monitor was stuck doing something. This something, more

    often than not, is leveldb. If this is the case, check the size of

    your leveldb. If it is over 5 or 6GB in size, you may need to

    manually compact the store (mon compact on start = true, iirc).

    HTH

       -Joao

        On Mon, Jul 25, 2016 at 4:10 PM, Joao Eduardo Luis <joao@xxxxxxx

        <mailto:joao@xxxxxxx>

        <mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>> wrote:

             On 07/25/2016 03:41 PM, Sergio A. de Carvalho Jr. wrote:

                 In the logs, there 2 monitors are constantly reporting

        that they

                 won the

                 leader election:

                 60z0m02 (monitor 0):

                 2016-07-25 14:31:11.644335 7f8760af7700  0

        log_channel(cluster) log

                 [INF] : mon.60z0m02@0 won leader election with quorum 0,2,4

                 2016-07-25 14:31:44.521552 7f8760af7700  1

                 mon.60z0m02@0(leader).paxos(paxos recovering c

        1318755..1319320)

                 collect

                 timeout, calling fresh election

                 60zxl02 (monitor 1):

                 2016-07-25 14:31:59.542346 7fefdeaed700  1

                 mon.60zxl02@1(electing).elector(11441) init, last seen

        epoch 11441

                 2016-07-25 14:32:04.583929 7fefdf4ee700  0

        log_channel(cluster) log

                 [INF] : mon.60zxl02@1 won leader election with quorum 1,2,4

                 2016-07-25 14:32:33.440103 7fefdf4ee700  1

                 mon.60zxl02@1(leader).paxos(paxos recovering c

        1318755..1319319)

                 collect

                 timeout, calling fresh election

             There are two likely scenarios to explain this:

             1. The monitors have different monitors in their monmaps - this

             could happen if you didn't add the new monitor via 'ceph

        mon add'.

             You can check this by running 'ceph daemon mon.<ID>

        mon_status' for

             each of the monitors in the cluster.

             2. some of the monitors are unable to communicate with each

        other,

             thus will never acknowledge the same leader. This does not

        mean you

             have two leaders for the same cluster, but it does mean

        that you

             will end up having two monitors declaring victory and

        become the

             self-proclaimed leader in the cluster. The peons should

        still only

             belong to one quorum.

             If this does not help you, try setting 'debug mon = 10' and

        'debug

             ms = 1' on the monitors and check the logs, making sure the

        monitors

             get the probes and follow the election process. If you need

        further

             assistance, put those logs online somewhere we can access

        them and

             we'll try to help you out.

                -Joao

                 On Mon, Jul 25, 2016 at 3:27 PM, Sergio A. de Carvalho Jr.

                 <scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>

        <mailto:scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>>

                 <mailto:scarvalhojr@xxxxxxxxx

        <mailto:scarvalhojr@xxxxxxxxx> <mailto:scarvalhojr@xxxxxxxxx

        <mailto:scarvalhojr@xxxxxxxxx>>>>

                 wrote:

                      Hi,

                      I have a cluster of 5 hosts running Ceph 0.94.6 on

        CentOS

                 6.5. On

                      each host, there is 1 monitor and 13 OSDs. We had

        an issue

                 with the

                      network and for some reason (which I still don't

        know why), the

                      servers were restarted. One host is still down,

        but the

                 monitors on

                      the 4 remaining servers are failing to enter a quorum.

                      I managed to get a quorum of 3 monitors by

        stopping all Ceph

                      monitors and OSDs across all machines, and

        bringing up the

                 top 3

                      ranked monitors in order of rank. After a few

        minutes, the

                 60z0m02

                      monitor (the top ranked one) became the leader:

                      {

                           "name": "60z0m02",

                           "rank": 0,

                           "state": "leader",

                           "election_epoch": 11328,

                           "quorum": [

                               0,

                               1,

                               2

                           ],

                           "outside_quorum": [],

                           "extra_probe_peers": [],

                           "sync_provider": [],

                           "monmap": {

                               "epoch": 5,

                               "fsid":

        "2f51a247-3155-4bcf-9aee-c6f6b2c5e2af",

                               "modified": "2016-04-28 22:26:48.604393",

                               "created": "0.000000",

                               "mons": [

                                   {

                                       "rank": 0,

                                       "name": "60z0m02",

                                       "addr": "10.98.2.166:6789

        <http://10.98.2.166:6789>

                 <http://10.98.2.166:6789> <http://10.98.2.166:6789>\/0"

                                   },

                                   {

                                       "rank": 1,

                                       "name": "60zxl02",

                                       "addr": "10.98.2.167:6789

        <http://10.98.2.167:6789>

                 <http://10.98.2.167:6789> <http://10.98.2.167:6789>\/0"

                                   },

                                   {

                                       "rank": 2,

                                       "name": "610wl02",

                                       "addr": "10.98.2.173:6789

        <http://10.98.2.173:6789>

                 <http://10.98.2.173:6789> <http://10.98.2.173:6789>\/0"

                                   },

                                   {

                                       "rank": 3,

                                       "name": "618yl02",

                                       "addr": "10.98.2.214:6789

        <http://10.98.2.214:6789>

                 <http://10.98.2.214:6789> <http://10.98.2.214:6789>\/0"

                                   },

                                   {

                                       "rank": 4,

                                       "name": "615yl02",

                                       "addr": "10.98.2.216:6789

        <http://10.98.2.216:6789>

                 <http://10.98.2.216:6789> <http://10.98.2.216:6789>\/0"

                                   }

                               ]

                           }

                      }

                      The other 2 monitors became peons:

                      "name": "60zxl02",

                           "rank": 1,

                           "state": "peon",

                           "election_epoch": 11328,

                           "quorum": [

                               0,

                               1,

                               2

                           ],

                      "name": "610wl02",

                           "rank": 2,

                           "state": "peon",

                           "election_epoch": 11328,

                           "quorum": [

                               0,

                               1,

                               2

                           ],

                      I then proceeded to start the fourth monitor, 615yl02

                 (618yl02 is

                      powered off), but after more than 2 hours and

        several election

                      rounds, the monitors still haven't reached a

        quorum. The

                 monitors

                      alternate mostly between "election", "probing"

        states but

                 they often

                      seem to be in different election epochs.

                      Is this normal?

                      Is there anything I can do to help the monitors

        elect a leader?

                      Should I manually remove the dead host's monitor

        from the

                 monitor map?

                      I left all OSD daemons stopped while the election

        is going on

                      purpose. Is this the best thing to do? Would

        bringing the

                 OSDs up

                      help or complicate matters even more? Or doesn't

        it make

                 any difference?

                      I don't see anything obviously wrong in the

        monitor logs.

                 They're

                      mostly filled with messages like the following:

                      2016-07-25 14:17:57.806148 7fc1b3f7e700  1

                      mon.610wl02@2(electing).elector(11411) init, last seen

                 epoch 11411

                      2016-07-25 14:17:57.829198 7fc1b7caf700  0

                 log_channel(audit) log

                      [DBG] : from='admin socket' entity='admin socket'

                 cmd='mon_status'

                      args=[]: dispatch

                      2016-07-25 14:17:57.829200 7fc1b7caf700  0

                 log_channel(audit) do_log

                      log to syslog

                      2016-07-25 14:17:57.829254 7fc1b7caf700  0

                 log_channel(audit) log

                      [DBG] : from='admin socket' entity='admin socket'

                 cmd=mon_status

                      args=[]: finished

                      Any help would be hugely appreciated.

                      Thanks,

                      Sergio

                 _______________________________________________

                 ceph-users mailing list

        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

        <mailto:ceph-users@xxxxxxxxxxxxxx

        <mailto:ceph-users@xxxxxxxxxxxxxx>>

        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

             _______________________________________________

             ceph-users mailing list

        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

        <mailto:ceph-users@xxxxxxxxxxxxxx

        <mailto:ceph-users@xxxxxxxxxxxxxx>>

        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com