Re: Monitors not reaching quorum

Sean Crosby <richardnixonshead@xxxxxxxxx> · Wed, 27 Jul 2016 21:04:29 +1000

Oh, my problems weren't on Ceph nodes. I've seen this problem on non-Ceph nodes. The symptoms you had of unexplained weirdness with services (in your case, Ceph), and syslog lagging 10mins behind just reminded me of symptoms I've seen before where the sending of syslog messages to a central syslog server got stuck, and caused unusual problems on the host.
Cheers,
Sean

On 27 July 2016 at 20:57, Sergio A. de Carvalho Jr. <scarvalhojr@xxxxxxxxx> wrote:
In my case, everything else running on the host seems to be okay. I'm wondering if the other problems you see aren't a side-effect of Ceph services running slow?
What do you do to get around the problem when it happens? Disable syslog in Ceph?

What version of Ceph and OS are you using?

On Wed, Jul 27, 2016 at 12:47 AM, Sean Crosby <scrosby@xxxxxxxxxxxxxx> wrote:
Agreed. When I first had these problems, random stuff would just not work. SSH would take a while to log in, DNS server would process requests slow, our Batch system would freeze and not run jobs. It's now one of my first things to check when services are running weirdly.
My failsafe check is to do

# logger "sean test"

and see if it appears in syslog. If it doesn't do it immediately, I have a problem

Cheers,
Sean

On 27 July 2016 at 04:01, Sergio A. de Carvalho Jr. <scarvalhojr@xxxxxxxxx> wrote:
The funny thing is that I just restarted the rsyslog daemon on the Ceph hosts and I can now re-enable syslog for Ceph without any issues. It just looks like the rsyslog service had a hiccup, possibly related to problem on one of the central syslog servers, and this in turn prevent the monitors to operate normally.
It's just scary to think that your logging daemon can cause so much damage!

On Tue, Jul 26, 2016 at 6:48 PM, Joao Eduardo Luis <joao@xxxxxxx> wrote:
On 07/26/2016 06:27 PM, Sergio A. de Carvalho Jr. wrote:

(Just realised I originally replied to Sean directly, so reposting here

for posterity).

Bingo!

wow. This didn't even cross my mind. D:

Thanks for sharing.

I turned off syslog and the monitors quickly reached quorum and

everything seems back to normal. Thanks so much, Sean.

Luckily this is a test cluster. I wonder how I could catch this in a

production cluster before our support engineers spend a day trying to

track the problem down.

Only way I can see to deal with this sort of thing would be to log to syslog on a separate thread and have said thread monitoring the latency when writing to syslog.

I don't think currently there's any support for that. I'll try to get something concocted this week, mostly for the fun of it.

  -Joao

Any ideas?

On Tue, Jul 26, 2016 at 12:28 PM, Sean Crosby

<richardnixonshead@xxxxxxxxx <mailto:richardnixonshead@xxxxxxxxx>> wrote:

    Hi Sergio,

    You don't happen to have rsyslog forwarding to a central log server

    by any chance? I've seen this behaviour before when my central log

    server is not keeping up with messages.

    Cheers,

    Sean

    On 26 July 2016 at 21:13, Sergio A. de Carvalho Jr.

    <scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>> wrote:

        I left the 4 nodes running overnight and they just crawled to

        their knees... to the point that nothing has been written to the

        logs in the last 11 hours. So I stopped all monitors this

        morning and started them one by one again, but they're are still

        being extremely slow. Here are their logs:

        <https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406><https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406><https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406>https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406

        https://gist.github.com/anonymous/f30a8903e701423825fd4d5aaa651e6a

        https://gist.github.com/anonymous/42a1856cc819de5b110d9f887e9859d2

        https://gist.github.com/anonymous/652bc41197e83a9d76cf5b2e6a211aa2

        I'm still puzzled to see logs being written with a timestamp

        that is several minutes behind the system clock. As time passes,

        the gap widens and quickly the logs are over 10 minutes behind

        the actual time, which explains why the logs  above don't seem

        to overlap.

        On Mon, Jul 25, 2016 at 9:37 PM, Sergio A. de Carvalho Jr.

        <scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>> wrote:

            Awesome, thanks so much, Joao.

            Here's the mon_status:

            https://gist.github.com/anonymous/2b80a9a75d134d9e539dfbc81615c055

            I'm still trying to collect the logs, but while doing that I

            noticed that the log records are severely delayed compared

            to the system clock. For example, watching the logs with

            tail -f, I see records with a timestamp that is up to 28

            minutes behind the system clock!

            Also, while trying to set debug level, the monitors

            sometimes hung for several minutes, so there's obviously

            something wrong with them.

            On Mon, Jul 25, 2016 at 6:16 PM, Joao Eduardo Luis

            <joao@xxxxxxx <mailto:joao@xxxxxxx>> wrote:

                On 07/25/2016 05:55 PM, Sergio A. de Carvalho Jr. wrote:

                    I just forced an NTP updated on all hosts to be sure

                    it's down to clock

                    skew. I also checked that hosts can reach all other

                    hosts on port 6789.

                    I then stopped monitor 0 (60z0m02) and started

                    monitor 1 (60zxl02), but

                    the 3 monitors left (1 - 60zxl02, 2 - 610wl02, 4 -

                    615yl02) were still

                    having problems to reach quorum. That led me to

                    believe monitor 4 was

                    the problem because I had a quorum before with

                    monitors 0, 1, 2.

                    So I stopped monitor 4 and started monitor 0 again,

                    but this time

                    monitors 0, 1, 2 failed to reach a quorum, which is

                    rather puzzling.

                    All hosts are pretty much idle all the time so I

                    can't see why monitors

                    would be getting stuck.

                Grab 'ceph daemon mon.<ID> mon_status' from all

                monitors, set 'debug mon = 10', 'debug paxos = 10',

                'debug ms = 1', grab logs from the monitors for the

                periods before, during and after election (ideally for

                at least two election cycles). Put them up somewhere on

                the interwebs and send us a link.

                If you don't feel comfortable putting that link in the

                list, send me an email directly with the url.

                I'll be happy to take a look later tonight.

                   -Joao

                    On Mon, Jul 25, 2016 at 5:18 PM, Joao Eduardo Luis

                    <joao@xxxxxxx <mailto:joao@xxxxxxx>

                    <mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>> wrote:

                         On 07/25/2016 04:34 PM, Sergio A. de Carvalho

                    Jr. wrote:

                             Thanks, Joao.

                             All monitors have the exact same mom map.

                             I suspect you're right that there might be

                    some communication

                             problem

                             though. I stopped monitor 1 (60zxl02), but

                    the other 3 monitors

                             still

                             failed to reach a quorum. I could see

                    monitor 0 was still declaring

                             victory but the others were always calling

                    for new elections:

                             2016-07-25 15:18:59.775144 7f8760af7700  0

                    log_channel(cluster) log

                             [INF] : mon.60z0m02@0 won leader election

                    with quorum 0,2,4

                             2016-07-25 15:18:54.702176 7fc1b357d700  1

                             mon.610wl02@2(electing) e5

                             handle_timecheck drop unexpected msg

                             2016-07-25 15:18:54.704526 7fc1b357d700  1

                             mon.610wl02@2(electing).data_health(11626)

                    service_dispatch not in

                             quorum -- drop message

                             2016-07-25 15:19:09.792511 <tel:09.792511>

                    <tel:09.792511 <tel:09.792511>> 7fc1b3f7e700  1

                             mon.610wl02@2(peon).paxos(paxos recovering

                    c 1318755..1319322)

                             lease_timeout -- calling new election

                             2016-07-25 15:19:09.792825 <tel:09.792825>

                    <tel:09.792825 <tel:09.792825>> 7fc1b357d700  0

                             log_channel(cluster) log

                             [INF] : mon.610wl02 calling new monitor

                    election

                             I'm curious about the "handle_timecheck

                    drop unexpected msg"

                             message.

                         timechecks (i.e., checking for clock skew), as

                    well as the

                         data_health service (which makes sure you have

                    enough disk space in

                         the mon data dir) are only run when you have a

                    quorum. If a message

                         is received by a monitor not in a quorum,

                    regardless of state, it

                         will be dropped.

                         Assuming you know took one of the

                    self-appointed leaders out - by

                         shutting it down, for instance -, you should

                    now check what's

                         causing elections not to hold.

                         In these cases, assuming your 3 monitors do

                    form a quorum, the

                         traditional issue tends to be 'lease timeouts'.

                    I.e., the leader

                         fails to provide a lease extension on paxos for

                    the peons, and the

                         peons assume the leader failed in some form

                    (unresponsive, down,

                         whatever).

                         Above it does seem a lease timeout was

                    triggered on a peon. This may

                         have happened because:

                         1. leader did not extend the lease

                         2. leader did extend the lease but lease was in

                    the past - usually

                         indication of a clock skew on the leader, on

                    the peons, or both.

                         3. leader did extend the lease, it was with the

                    correct time but

                         peon failed to dispatch the message on time.

                         Both 1. and 2. may be due to several factors,

                    but most commonly it's

                         because the monitor was stuck doing something.

                    This something, more

                         often than not, is leveldb. If this is the

                    case, check the size of

                         your leveldb. If it is over 5 or 6GB in size,

                    you may need to

                         manually compact the store (mon compact on

                    start = true, iirc).

                         HTH

                            -Joao

                             On Mon, Jul 25, 2016 at 4:10 PM, Joao

                    Eduardo Luis <joao@xxxxxxx <mailto:joao@xxxxxxx>

                             <mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>

                             <mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>

                    <mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>>> wrote:

                                  On 07/25/2016 03:41 PM, Sergio A. de

                    Carvalho Jr. wrote:

                                      In the logs, there 2 monitors are

                    constantly reporting

                             that they

                                      won the

                                      leader election:

                                      60z0m02 (monitor 0):

                                      2016-07-25 14:31:11.644335

                    7f8760af7700  0

                             log_channel(cluster) log

                                      [INF] : mon.60z0m02@0 won leader

                    election with quorum 0,2,4

                                      2016-07-25 14:31:44.521552

                    7f8760af7700  1

                                      mon.60z0m02@0(leader).paxos(paxos

                    recovering c

                             1318755..1319320)

                                      collect

                                      timeout, calling fresh election

                                      60zxl02 (monitor 1):

                                      2016-07-25 14:31:59.542346

                    7fefdeaed700  1

                      mon.60zxl02@1(electing).elector(11441) init, last seen

                             epoch 11441

                                      2016-07-25 14:32:04.583929

                    7fefdf4ee700  0

                             log_channel(cluster) log

                                      [INF] : mon.60zxl02@1 won leader

                    election with quorum 1,2,4

                                      2016-07-25 14:32:33.440103

                    7fefdf4ee700  1

                                      mon.60zxl02@1(leader).paxos(paxos

                    recovering c

                             1318755..1319319)

                                      collect

                                      timeout, calling fresh election

                                  There are two likely scenarios to

                    explain this:

                                  1. The monitors have different

                    monitors in their monmaps - this

                                  could happen if you didn't add the new

                    monitor via 'ceph

                             mon add'.

                                  You can check this by running 'ceph

                    daemon mon.<ID>

                             mon_status' for

                                  each of the monitors in the cluster.

                                  2. some of the monitors are unable to

                    communicate with each

                             other,

                                  thus will never acknowledge the same

                    leader. This does not

                             mean you

                                  have two leaders for the same cluster,

                    but it does mean

                             that you

                                  will end up having two monitors

                    declaring victory and

                             become the

                                  self-proclaimed leader in the cluster.

                    The peons should

                             still only

                                  belong to one quorum.

                                  If this does not help you, try setting

                    'debug mon = 10' and

                             'debug

                                  ms = 1' on the monitors and check the

                    logs, making sure the

                             monitors

                                  get the probes and follow the election

                    process. If you need

                             further

                                  assistance, put those logs online

                    somewhere we can access

                             them and

                                  we'll try to help you out.

                                     -Joao

                                      On Mon, Jul 25, 2016 at 3:27 PM,

                    Sergio A. de Carvalho Jr.

                                      <scarvalhojr@xxxxxxxxx

                    <mailto:scarvalhojr@xxxxxxxxx>

                    <mailto:scarvalhojr@xxxxxxxxx

                    <mailto:scarvalhojr@xxxxxxxxx>>

                             <mailto:scarvalhojr@xxxxxxxxx

                    <mailto:scarvalhojr@xxxxxxxxx>

                    <mailto:scarvalhojr@xxxxxxxxx

                    <mailto:scarvalhojr@xxxxxxxxx>>>

                                      <mailto:scarvalhojr@xxxxxxxxx

                    <mailto:scarvalhojr@xxxxxxxxx>

                             <mailto:scarvalhojr@xxxxxxxxx

                    <mailto:scarvalhojr@xxxxxxxxx>>

                    <mailto:scarvalhojr@xxxxxxxxx

                    <mailto:scarvalhojr@xxxxxxxxx>

                             <mailto:scarvalhojr@xxxxxxxxx

                    <mailto:scarvalhojr@xxxxxxxxx>>>>>

                                      wrote:

                                           Hi,

                                           I have a cluster of 5 hosts

                    running Ceph 0.94.6 on

                             CentOS

                                      6.5. On

                                           each host, there is 1 monitor

                    and 13 OSDs. We had

                             an issue

                                      with the

                                           network and for some reason

                    (which I still don't

                             know why), the

                                           servers were restarted. One

                    host is still down,

                             but the

                                      monitors on

                                           the 4 remaining servers are

                    failing to enter a quorum.

                                           I managed to get a quorum of

                    3 monitors by

                             stopping all Ceph

                                           monitors and OSDs across all

                    machines, and

                             bringing up the

                                      top 3

                                           ranked monitors in order of

                    rank. After a few

                             minutes, the

                                      60z0m02

                                           monitor (the top ranked one)

                    became the leader:

                                           {

                                                "name": "60z0m02",

                                                "rank": 0,

                                                "state": "leader",

                                                "election_epoch": 11328,

                                                "quorum": [

                                                    0,

                                                    1,

                                                    2

                                                ],

                                                "outside_quorum": [],

                                                "extra_probe_peers": [],

                                                "sync_provider": [],

                                                "monmap": {

                                                    "epoch": 5,

                                                    "fsid":

                             "2f51a247-3155-4bcf-9aee-c6f6b2c5e2af",

                                                    "modified":

                    "2016-04-28 22:26:48.604393",

                                                    "created": "0.000000",

                                                    "mons": [

                                                        {

                                                            "rank": 0,

                                                            "name":

                    "60z0m02",

                                                            "addr":

                    "10.98.2.166:6789 <http://10.98.2.166:6789>

                             <http://10.98.2.166:6789>

                                      <http://10.98.2.166:6789>

                    <http://10.98.2.166:6789>\/0"

                                                        },

                                                        {

                                                            "rank": 1,

                                                            "name":

                    "60zxl02",

                                                            "addr":

                    "10.98.2.167:6789 <http://10.98.2.167:6789>

                             <http://10.98.2.167:6789>

                                      <http://10.98.2.167:6789>

                    <http://10.98.2.167:6789>\/0"

                                                        },

                                                        {

                                                            "rank": 2,

                                                            "name":

                    "610wl02",

                                                            "addr":

                    "10.98.2.173:6789 <http://10.98.2.173:6789>

                             <http://10.98.2.173:6789>

                                      <http://10.98.2.173:6789>

                    <http://10.98.2.173:6789>\/0"

                                                        },

                                                        {

                                                            "rank": 3,

                                                            "name":

                    "618yl02",

                                                            "addr":

                    "10.98.2.214:6789 <http://10.98.2.214:6789>

                             <http://10.98.2.214:6789>

                                      <http://10.98.2.214:6789>

                    <http://10.98.2.214:6789>\/0"

                                                        },

                                                        {

                                                            "rank": 4,

                                                            "name":

                    "615yl02",

                                                            "addr":

                    "10.98.2.216:6789 <http://10.98.2.216:6789>

                             <http://10.98.2.216:6789>

                                      <http://10.98.2.216:6789>

                    <http://10.98.2.216:6789>\/0"

                                                        }

                                                    ]

                                                }

                                           }

                                           The other 2 monitors became

                    peons:

                                           "name": "60zxl02",

                                                "rank": 1,

                                                "state": "peon",

                                                "election_epoch": 11328,

                                                "quorum": [

                                                    0,

                                                    1,

                                                    2

                                                ],

                                           "name": "610wl02",

                                                "rank": 2,

                                                "state": "peon",

                                                "election_epoch": 11328,

                                                "quorum": [

                                                    0,

                                                    1,

                                                    2

                                                ],

                                           I then proceeded to start the

                    fourth monitor, 615yl02

                                      (618yl02 is

                                           powered off), but after more

                    than 2 hours and

                             several election

                                           rounds, the monitors still

                    haven't reached a

                             quorum. The

                                      monitors

                                           alternate mostly between

                    "election", "probing"

                             states but

                                      they often

                                           seem to be in different

                    election epochs.

                                           Is this normal?

                                           Is there anything I can do to

                    help the monitors

                             elect a leader?

                                           Should I manually remove the

                    dead host's monitor

                             from the

                                      monitor map?

                                           I left all OSD daemons

                    stopped while the election

                             is going on

                                           purpose. Is this the best

                    thing to do? Would

                             bringing the

                                      OSDs up

                                           help or complicate matters

                    even more? Or doesn't

                             it make

                                      any difference?

                                           I don't see anything

                    obviously wrong in the

                             monitor logs.

                                      They're

                                           mostly filled with messages

                    like the following:

                                           2016-07-25 14:17:57.806148

                    7fc1b3f7e700  1

                    mon.610wl02@2(electing).elector(11411) init, last seen

                                      epoch 11411

                                           2016-07-25 14:17:57.829198

                    7fc1b7caf700  0

                                      log_channel(audit) log

                                           [DBG] : from='admin socket'

                    entity='admin socket'

                                      cmd='mon_status'

                                           args=[]: dispatch

                                           2016-07-25 14:17:57.829200

                    7fc1b7caf700  0

                                      log_channel(audit) do_log

                                           log to syslog

                                           2016-07-25 14:17:57.829254

                    7fc1b7caf700  0

                                      log_channel(audit) log

                                           [DBG] : from='admin socket'

                    entity='admin socket'

                                      cmd=mon_status

                                           args=[]: finished

                                           Any help would be hugely

                    appreciated.

                                           Thanks,

                                           Sergio

                      _______________________________________________

                                      ceph-users mailing list

                    ceph-users@xxxxxxxxxxxxxx

                    <mailto:ceph-users@xxxxxxxxxxxxxx>

                    <mailto:ceph-users@xxxxxxxxxxxxxx

                    <mailto:ceph-users@xxxxxxxxxxxxxx>>

                             <mailto:ceph-users@xxxxxxxxxxxxxx

                    <mailto:ceph-users@xxxxxxxxxxxxxx>

                             <mailto:ceph-users@xxxxxxxxxxxxxx

                    <mailto:ceph-users@xxxxxxxxxxxxxx>>>

                    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                      _______________________________________________

                                  ceph-users mailing list

                    ceph-users@xxxxxxxxxxxxxx

                    <mailto:ceph-users@xxxxxxxxxxxxxx>

                    <mailto:ceph-users@xxxxxxxxxxxxxx

                    <mailto:ceph-users@xxxxxxxxxxxxxx>>

                             <mailto:ceph-users@xxxxxxxxxxxxxx

                    <mailto:ceph-users@xxxxxxxxxxxxxx>

                             <mailto:ceph-users@xxxxxxxxxxxxxx

                    <mailto:ceph-users@xxxxxxxxxxxxxx>>>

                    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        _______________________________________________

        ceph-users mailing list

        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Sean Crosby
Research Computing

ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics | University of Melbourne VIC 3010 Australia

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com