Re: OSD state flipping when cluster-network in high utilization

Tim Mohlmann <muhlemmer@xxxxxxxxx> · Wed, 15 May 2013 21:25:35 +0200

Just for my information / learning process:

This ping checking, is het a process itself, or is it part of the OSD process? 
(Maybe as a sub-process). In that case one could play with nice settings? Eg: 
marking the process as a realtime, or just a very low nice value. Then, at 
least, it would not bail on you when the CPU is at 100%. (preferring the pings 
over the OSD itself).

I am actually new to ceph (exploring possiblities), so I am not famliar with 
its internals. So consider it as a "if, then" sugestion.

Regards,

Tim

On Tuesday 14 May 2013 16:22:43 Sage Weil wrote:
> On Tue, 14 May 2013, Chen, Xiaoxi wrote: 
> > I like the idea to leave ping in cluster network because it can help us
> > detect switch?nic failure.
> > 
> > What confuse me is I keep pinging every ceph node's cluster ip?it is OK
> > during the whole run with less than 1 ms latency?why the heartbeat still
> > suffer? TOP show my cpu not 100% utilized?with ?30% io wait?.Enabling
> > jumbo frame **seems** make things worth.(just feeling.no data supports)
> 
> I say "ping" in teh general sense.. it's not using ICMP, but sending
> small messages over a TCP session, doing some minimal processing on the
> other end, and sending them back.  If the machine is heavily loaded and
> that thread doesn't get scheduled or somehow blocks, it may be
> problematic.
> 
> How responsive generally is the machine under load?  Is there available
> CPU?
> 
> We can try to enable debugging to see what is going on.. 'debug ms = 1'
> and 'debug osd = 20' is everything we would need, but will incur
> additoinal load itself and may spoil the experiment...
> 
> sage
> 
> > ???? iPhone
> > 
> > ? 2013-5-14?23:36?"Mark Nelson" <mark.nelson@xxxxxxxxxxx> ???
> > 
> > > On 05/14/2013 10:30 AM, Sage Weil wrote:
> > >> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > >>> Hi
> > >>> 
> > >>>   We are suffering our OSD flipping between up and down ( OSD X be
> > >>>   voted to
> > >>> 
> > >>> down due to 3 missing ping, and after a while it tells the monitor
> > >>> ?map xxx
> > >>> wrongly mark me down? ). Because we are running sequential write
> > >>> performance test on top of RBDs, and the cluster network nics is
> > >>> really in high utilization (8Gb/s+ for a 10Gb network).
> > >>> 
> > >>>          Is this a expected behavior ? or how can I prevent this
> > >>>          happen?
> > >> 
> > >> You an increase the heartbeat grace period.  The pings are handled by a
> > >> separate thread on the backside interface (if there is one).  If you
> > >> are
> > >> missing pings then the network or scheduler is preventing those (small)
> > >> messages from being processed (there is almost no lock contention in
> > >> that
> > >> path).  Which means it really is taking ~20 seconds or wahtever to
> > >> handle
> > >> those messages.  It's really a questin of how unresponsive you want to
> > >> permit the OSDs to be before you consider it a failure..
> > >> 
> > >> sage
> > > 
> > > It might be worth testing out how long pings or other network traffic
> > > are taking during these tests.  There may be some tcp tunning you can
> > > do here, or even consider using a separate network for the mons.
> > > 
> > > Mark
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com