Re: problem returning mon back to cluster

Nikola Ciprich <nikola.ciprich@xxxxxxxxxxx> · Tue, 15 Oct 2019 06:50:31 +0200

On Mon, Oct 14, 2019 at 11:52:55PM +0200, Paul Emmerich wrote:
> How big is the mon's DB?  As in just the total size of the directory you copied
> 
> FWIW I recently had to perform mon surgery on a 14.2.4 (or was it
> 14.2.2?) cluster with 8 GB mon size and I encountered no such problems
> while syncing a new mon which took 10 minutes or so.
Hi Paul,

yup I forgot to mention this.. It doesn't seem to be too big, just about
100MB. I also noticed that while third monitor tries to join the cluster,
leader starts flapping between "leader" and "electing", so I suppose it's
quorum forming problem.. I tried bumping debug_ms and debug_paxos but
couldn't make head or tails of it.. can paste the logs somewhere if it
can help

BR

nik

> 
> Paul
> 
> -- 
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
> 
> On Mon, Oct 14, 2019 at 9:41 PM Nikola Ciprich
> <nikola.ciprich@xxxxxxxxxxx> wrote:
> >
> > On Mon, Oct 14, 2019 at 04:31:22PM +0200, Nikola Ciprich wrote:
> > > On Mon, Oct 14, 2019 at 01:40:19PM +0200, Harald Staub wrote:
> > > > Probably same problem here. When I try to add another MON, "ceph
> > > > health" becomes mostly unresponsive. One of the existing ceph-mon
> > > > processes uses 100% CPU for several minutes. Tried it on 2 test
> > > > clusters (14.2.4, 3 MONs, 5 storage nodes with around 2 hdd osds
> > > > each). To avoid errors like "lease timeout", I temporarily increase
> > > > "mon lease", from 5 to 50 seconds.
> > > >
> > > > Not sure how bad it is from a customer PoV. But it is a problem by
> > > > itself to be several minutes without "ceph health", when there is an
> > > > increased risk of losing the quorum ...
> > >
> > > Hi Harry,
> > >
> > > thanks a lot for your reply! not sure we're experiencing the same issue,
> > > i don't have it on any other cluster.. when this is happening to you, does
> > > only ceph health stop working, or it also blocks all clients IO?
> > >
> > > BR
> > >
> > > nik
> > >
> > >
> > > >
> > > >  Harry
> > > >
> > > > On 13.10.19 20:26, Nikola Ciprich wrote:
> > > > >dear ceph users and developers,
> > > > >
> > > > >on one of our production clusters, we got into pretty unpleasant situation.
> > > > >
> > > > >After rebooting one of the nodes, when trying to start monitor, whole cluster
> > > > >seems to hang, including IO, ceph -s etc. When this mon is stopped again,
> > > > >everything seems to continue. Traying to spawn new monitor leads to the same problem
> > > > >(even on different node).
> > > > >
> > > > >I had to give up after minutes of outage, since it's unacceptable. I think we had this
> > > > >problem once in the past on this cluster, but after some (but much shorter) time, monitor
> > > > >joined and it worked fine since then.
> > > > >
> > > > >All cluster nodes are centos 7 machines, I have 3 monitors (so 2 are now running), I'm
> > > > >using ceph 13.2.6
> > > > >
> > > > >Network connection seems to be fine.
> > > > >
> > > > >Anyone seen similar problem? I'd be very grateful for tips on how to debug and solve this..
> > > > >
> > > > >for those interested, here's log of one of running monitors with debug_mon set to 10/10:
> > > > >
> > > > >https://storage.lbox.cz/public/d258d0
> > > > >
> > > > >if I could provide more info, please let me know
> > > > >
> > > > >with best regards
> > > > >
> > > > >nikola ciprich
> >
> > just to add quick update, I was able to reproduce the issue by transferring monitor
> > directories to test environmen with same IP adressing, so I can safely play with that
> > now..
> >
> > increasing lease timeout didn't help me to fix the problem,
> > but at least I seem to be able to use ceph -s now.
> >
> > few things I noticed in the meantime:
> >
> > - when I start problematic monitor, monitor slow ops start to appear for
> > quorum leader and the count is slowly increasing:
> >
> >             44 slow ops, oldest one blocked for 130 sec, mon.nodev1c has slow ops
> >
> > - removing and recreating monitor didn't help
> >
> > - checking mon_status of problematic monitor shows it remains in the "synchronizing" state
> >
> > I tried increasing debug_ms and debug_paxos but didn't see anything usefull there..
> >
> > will report further when I got something. I anyone has any idea in the meantime, please
> > let me know.
> >
> > BR
> >
> > nik
> >
> >
> >
> >
> > --
> > -------------------------------------
> > Ing. Nikola CIPRICH
> > LinuxBox.cz, s.r.o.
> > 28. rijna 168, 709 00 Ostrava
> >
> > tel.:   +420 591 166 214
> > fax:    +420 596 621 273
> > mobil:  +420 777 093 799
> >
> > www.linuxbox.cz
> >
> > mobil servis: +420 737 238 656
> > email servis: servis@xxxxxxxxxxx
> > -------------------------------------
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@xxxxxxxxxxx
-------------------------------------
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com