Re: OSD state flipping when cluster-network in high utilization

Leen Besselink <leen@xxxxxxxxxxxxxxxxx> · Wed, 15 May 2013 21:31:22 +0200

Hi Tim,

On Wed, May 15, 2013 at 09:25:35PM +0200, Tim Mohlmann wrote:
> Just for my information / learning process:
> 
> This ping checking, is het a process itself, or is it part of the OSD process? 
> (Maybe as a sub-process). In that case one could play with nice settings? Eg: 
> marking the process as a realtime, or just a very low nice value. Then, at 
> least, it would not bail on you when the CPU is at 100%. (preferring the pings 
> over the OSD itself).
> 

I believe it is a thread of the OSD-process.

So it could in theory be that something in the process, like a lock or kernel/scheduler
is blocking the ping process from being actived.

> I am actually new to ceph (exploring possiblities), so I am not famliar with 
> its internals. So consider it as a "if, then" sugestion.
> 

That is fine, we all have to learn it somehow. :-)

> Regards,
> 
> Tim
> 
> On Tuesday 14 May 2013 16:22:43 Sage Weil wrote:
> > On Tue, 14 May 2013, Chen, Xiaoxi wrote: 
> > > I like the idea to leave ping in cluster network because it can help us
> > > detect switch?nic failure.
> > > 
> > > What confuse me is I keep pinging every ceph node's cluster ip?it is OK
> > > during the whole run with less than 1 ms latency?why the heartbeat still
> > > suffer? TOP show my cpu not 100% utilized?with ?30% io wait?.Enabling
> > > jumbo frame **seems** make things worth.(just feeling.no data supports)
> > 
> > I say "ping" in teh general sense.. it's not using ICMP, but sending
> > small messages over a TCP session, doing some minimal processing on the
> > other end, and sending them back.  If the machine is heavily loaded and
> > that thread doesn't get scheduled or somehow blocks, it may be
> > problematic.
> > 
> > How responsive generally is the machine under load?  Is there available
> > CPU?
> > 
> > We can try to enable debugging to see what is going on.. 'debug ms = 1'
> > and 'debug osd = 20' is everything we would need, but will incur
> > additoinal load itself and may spoil the experiment...
> > 
> > sage
> > 
> > > ???? iPhone
> > > 
> > > ? 2013-5-14?23:36?"Mark Nelson" <mark.nelson@xxxxxxxxxxx> ???
> > > 
> > > > On 05/14/2013 10:30 AM, Sage Weil wrote:
> > > >> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > > >>> Hi
> > > >>> 
> > > >>>   We are suffering our OSD flipping between up and down ( OSD X be
> > > >>>   voted to
> > > >>> 
> > > >>> down due to 3 missing ping, and after a while it tells the monitor
> > > >>> ?map xxx
> > > >>> wrongly mark me down? ). Because we are running sequential write
> > > >>> performance test on top of RBDs, and the cluster network nics is
> > > >>> really in high utilization (8Gb/s+ for a 10Gb network).
> > > >>> 
> > > >>>          Is this a expected behavior ? or how can I prevent this
> > > >>>          happen?
> > > >> 
> > > >> You an increase the heartbeat grace period.  The pings are handled by a
> > > >> separate thread on the backside interface (if there is one).  If you
> > > >> are
> > > >> missing pings then the network or scheduler is preventing those (small)
> > > >> messages from being processed (there is almost no lock contention in
> > > >> that
> > > >> path).  Which means it really is taking ~20 seconds or wahtever to
> > > >> handle
> > > >> those messages.  It's really a questin of how unresponsive you want to
> > > >> permit the OSDs to be before you consider it a failure..
> > > >> 
> > > >> sage
> > > > 
> > > > It might be worth testing out how long pings or other network traffic
> > > > are taking during these tests.  There may be some tcp tunning you can
> > > > do here, or even consider using a separate network for the mons.
> > > > 
> > > > Mark
> > 
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com