Re: Luminous cluster stuck when adding monitor

Nico Schottelius <nico.schottelius@xxxxxxxxxxx> · Sat, 07 Oct 2017 22:52:32 +0200

Good evening Joao,

we double checked our MTUs, they are all 9200 on the servers and 9212 on
the switches. And we have no problems transferring big files in general
(as opennebula copies around images for importing, we do this quite a
lot).

So if you could have a look, it would be much appreciated.

If we should collect other logs, just let us know.

Best,

Nico

Joao Eduardo Luis <joao@xxxxxxx> writes:

> On 10/04/2017 09:19 PM, Gregory Farnum wrote:
>> Oh, hmm, you're right. I see synchronization starts but it seems to
>> progress very slowly, and it certainly doesn't complete in that 2.5
>> minute logging window. I don't see any clear reason why it's so
>> slow; it might be more clear if you could provide logs of the other
>> logs at the same time (especially since you now say they are getting
>> stuck in the electing state during that period). Perhaps Kefu or
>> Joao will have some clearer idea what the problem is.
>> -Greg
>
> I haven't gone through logs yet (maybe Friday, it's late today and
> it's a holiday tomorrow), but not so long ago I seem to recall someone
> having a similar issue with the monitors that was solely related to a
> switch's MTU being too small.
>
> Maybe that could be the case? If not, I'll take a look at the logs as
> soon as possible.
>
>   -Joao
>
>>
>> On Wed, Oct 4, 2017 at 1:04 PM Nico Schottelius
>> <nico.schottelius@xxxxxxxxxxx <mailto:nico.schottelius@xxxxxxxxxxx>>
>> wrote:
>>
>>
>>     Some more detail:
>>
>>     when restarting the monitor on server1, it stays in synchronizing state
>>     forever.
>>
>>     However the other two monitors change into electing state.
>>
>>     I have double checked that there are not (host) firewalls active and
>>     that the times are within 1 second different of the hosts (they all have
>>     ntpd running).
>>
>>     We are running everything on IPv6, but this should not be a problem,
>>     should it?
>>
>>     Best,
>>
>>     Nico
>>
>>
>>     Nico Schottelius <nico.schottelius@xxxxxxxxxxx
>>     <mailto:nico.schottelius@xxxxxxxxxxx>> writes:
>>
>>      > Hello Gregory,
>>      >
>>      > the logfile I produced has already debug mon = 20 set:
>>      >
>>      > [21:03:51] server1:~# grep "debug mon" /etc/ceph/ceph.conf
>>      > debug mon = 20
>>      >
>>      > It is clear that server1 is out of quorum, however how do we make it
>>      > being part of the quorum again?
>>      >
>>      > I expected that the quorum finding process is triggered automatically
>>      > after restarting the monitor, or is that incorrect?
>>      >
>>      > Best,
>>      >
>>      > Nico
>>      >
>>      >
>>      > Gregory Farnum <gfarnum@xxxxxxxxxx <mailto:gfarnum@xxxxxxxxxx>>
>>     writes:
>>      >
>>      >> You'll need to change the config so that it's running "debug mon
>>     = 20" for
>>      >> the log to be very useful here. It does say that it's dropping
>>     client
>>      >> connections because it's been out of quorum for too long, which
>>     is the
>>      >> correct behavior in general. I'd imagine that you've got clients
>>     trying to
>>      >> connect to the new monitor instead of the ones already in the
>>     quorum and
>>      >> not passing around correctly; this is all configurable.
>>      >>
>>      >> On Wed, Oct 4, 2017 at 4:09 AM Nico Schottelius <
>>      >> nico.schottelius@xxxxxxxxxxx
>>     <mailto:nico.schottelius@xxxxxxxxxxx>> wrote:
>>      >>
>>      >>>
>>      >>> Good morning,
>>      >>>
>>      >>> we have recently upgraded our kraken cluster to luminous and
>>     since then
>>      >>> noticed an odd behaviour: we cannot add a monitor anymore.
>>      >>>
>>      >>> As soon as we start a new monitor (server2), ceph -s and ceph
>>     -w start to
>>      >>> hang.
>>      >>>
>>      >>> The situation became worse, since one of our staff stopped an
>>     existing
>>      >>> monitor (server1), as restarting that monitor results in the same
>>      >>> situation, ceph -s hangs until we stop the monitor again.
>>      >>>
>>      >>> We kept the monitor running for some minutes, but the situation
>>     never
>>      >>> cleares up.
>>      >>>
>>      >>> The network does not have any firewall in between the nodes and
>>     there
>>      >>> are no host firewalls.
>>      >>>
>>      >>> I have attached the output of the monitor on server1, running in
>>      >>> foreground using
>>      >>>
>>      >>> root@server1:~# ceph-mon -i server1 --pid-file
>>      >>> /var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf
>>     --cluster ceph
>>      >>> --setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog
>>      >>>
>>      >>> Does anyone see any obvious problem in the attached log?
>>      >>>
>>      >>> Any input or hint would be appreciated!
>>      >>>
>>      >>> Best,
>>      >>>
>>      >>> Nico
>>      >>>
>>      >>>
>>      >>>
>>      >>> --
>>      >>> Modern, affordable, Swiss Virtual Machines. Visit
>>     www.datacenterlight.ch <http://www.datacenterlight.ch>
>>      >>> _______________________________________________
>>      >>> ceph-users mailing list
>>      >>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>      >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>      >>>
>>
>>
>>     --
>>     Modern, affordable, Swiss Virtual Machines. Visit
>>     www.datacenterlight.ch <http://www.datacenterlight.ch>
>>

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com