Re: problem returning mon back to cluster

Nikola Ciprich <nikola.ciprich@xxxxxxxxxxx> · Tue, 15 Oct 2019 07:17:38 +0200

On Tue, Oct 15, 2019 at 06:50:31AM +0200, Nikola Ciprich wrote:
> 
> 
> On Mon, Oct 14, 2019 at 11:52:55PM +0200, Paul Emmerich wrote:
> > How big is the mon's DB?  As in just the total size of the directory you copied
> > 
> > FWIW I recently had to perform mon surgery on a 14.2.4 (or was it
> > 14.2.2?) cluster with 8 GB mon size and I encountered no such problems
> > while syncing a new mon which took 10 minutes or so.
> Hi Paul,
> 
> yup I forgot to mention this.. It doesn't seem to be too big, just about
> 100MB. I also noticed that while third monitor tries to join the cluster,
> leader starts flapping between "leader" and "electing", so I suppose it's
> quorum forming problem.. I tried bumping debug_ms and debug_paxos but
> couldn't make head or tails of it.. can paste the logs somewhere if it
> can help

btw I just noticed, that on test cluster, third mon finally managed to join
the cluster and forum got formed.. after more then 6 hours.. knowing that during
it, the IO blocks for clients, it's pretty scary

now I can stop/start monitors without problems on it.. so it somehow got "fixed"

still dunno what to do with this production cluster though, so I'll just prepare
test environment again and try digging more into it

BR

nik

> 
> BR
> 
> nik
> 
> 
> 
> > 
> > Paul
> > 
> > -- 
> > Paul Emmerich
> > 
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> > 
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> > 
> > On Mon, Oct 14, 2019 at 9:41 PM Nikola Ciprich
> > <nikola.ciprich@xxxxxxxxxxx> wrote:
> > >
> > > On Mon, Oct 14, 2019 at 04:31:22PM +0200, Nikola Ciprich wrote:
> > > > On Mon, Oct 14, 2019 at 01:40:19PM +0200, Harald Staub wrote:
> > > > > Probably same problem here. When I try to add another MON, "ceph
> > > > > health" becomes mostly unresponsive. One of the existing ceph-mon
> > > > > processes uses 100% CPU for several minutes. Tried it on 2 test
> > > > > clusters (14.2.4, 3 MONs, 5 storage nodes with around 2 hdd osds
> > > > > each). To avoid errors like "lease timeout", I temporarily increase
> > > > > "mon lease", from 5 to 50 seconds.
> > > > >
> > > > > Not sure how bad it is from a customer PoV. But it is a problem by
> > > > > itself to be several minutes without "ceph health", when there is an
> > > > > increased risk of losing the quorum ...
> > > >
> > > > Hi Harry,
> > > >
> > > > thanks a lot for your reply! not sure we're experiencing the same issue,
> > > > i don't have it on any other cluster.. when this is happening to you, does
> > > > only ceph health stop working, or it also blocks all clients IO?
> > > >
> > > > BR
> > > >
> > > > nik
> > > >
> > > >
> > > > >
> > > > >  Harry
> > > > >
> > > > > On 13.10.19 20:26, Nikola Ciprich wrote:
> > > > > >dear ceph users and developers,
> > > > > >
> > > > > >on one of our production clusters, we got into pretty unpleasant situation.
> > > > > >
> > > > > >After rebooting one of the nodes, when trying to start monitor, whole cluster
> > > > > >seems to hang, including IO, ceph -s etc. When this mon is stopped again,
> > > > > >everything seems to continue. Traying to spawn new monitor leads to the same problem
> > > > > >(even on different node).
> > > > > >
> > > > > >I had to give up after minutes of outage, since it's unacceptable. I think we had this
> > > > > >problem once in the past on this cluster, but after some (but much shorter) time, monitor
> > > > > >joined and it worked fine since then.
> > > > > >
> > > > > >All cluster nodes are centos 7 machines, I have 3 monitors (so 2 are now running), I'm
> > > > > >using ceph 13.2.6
> > > > > >
> > > > > >Network connection seems to be fine.
> > > > > >
> > > > > >Anyone seen similar problem? I'd be very grateful for tips on how to debug and solve this..
> > > > > >
> > > > > >for those interested, here's log of one of running monitors with debug_mon set to 10/10:
> > > > > >
> > > > > >https://storage.lbox.cz/public/d258d0
> > > > > >
> > > > > >if I could provide more info, please let me know
> > > > > >
> > > > > >with best regards
> > > > > >
> > > > > >nikola ciprich
> > >
> > > just to add quick update, I was able to reproduce the issue by transferring monitor
> > > directories to test environmen with same IP adressing, so I can safely play with that
> > > now..
> > >
> > > increasing lease timeout didn't help me to fix the problem,
> > > but at least I seem to be able to use ceph -s now.
> > >
> > > few things I noticed in the meantime:
> > >
> > > - when I start problematic monitor, monitor slow ops start to appear for
> > > quorum leader and the count is slowly increasing:
> > >
> > >             44 slow ops, oldest one blocked for 130 sec, mon.nodev1c has slow ops
> > >
> > > - removing and recreating monitor didn't help
> > >
> > > - checking mon_status of problematic monitor shows it remains in the "synchronizing" state
> > >
> > > I tried increasing debug_ms and debug_paxos but didn't see anything usefull there..
> > >
> > > will report further when I got something. I anyone has any idea in the meantime, please
> > > let me know.
> > >
> > > BR
> > >
> > > nik
> > >
> > >
> > >
> > >
> > > --
> > > -------------------------------------
> > > Ing. Nikola CIPRICH
> > > LinuxBox.cz, s.r.o.
> > > 28. rijna 168, 709 00 Ostrava
> > >
> > > tel.:   +420 591 166 214
> > > fax:    +420 596 621 273
> > > mobil:  +420 777 093 799
> > >
> > > www.linuxbox.cz
> > >
> > > mobil servis: +420 737 238 656
> > > email servis: servis@xxxxxxxxxxx
> > > -------------------------------------
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> -- 
> -------------------------------------
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
> 
> tel.:   +420 591 166 214
> fax:    +420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
> 
> mobil servis: +420 737 238 656
> email servis: servis@xxxxxxxxxxx
> -------------------------------------
> 

-- 
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@xxxxxxxxxxx
-------------------------------------
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com