Just for my information / learning process: This ping checking, is het a process itself, or is it part of the OSD process? (Maybe as a sub-process). In that case one could play with nice settings? Eg: marking the process as a realtime, or just a very low nice value. Then, at least, it would not bail on you when the CPU is at 100%. (preferring the pings over the OSD itself). I am actually new to ceph (exploring possiblities), so I am not famliar with its internals. So consider it as a "if, then" sugestion. Regards, Tim On Tuesday 14 May 2013 16:22:43 Sage Weil wrote: > On Tue, 14 May 2013, Chen, Xiaoxi wrote: > > I like the idea to leave ping in cluster network because it can help us > > detect switch?nic failure. > > > > What confuse me is I keep pinging every ceph node's cluster ip?it is OK > > during the whole run with less than 1 ms latency?why the heartbeat still > > suffer? TOP show my cpu not 100% utilized?with ?30% io wait?.Enabling > > jumbo frame **seems** make things worth.(just feeling.no data supports) > > I say "ping" in teh general sense.. it's not using ICMP, but sending > small messages over a TCP session, doing some minimal processing on the > other end, and sending them back. If the machine is heavily loaded and > that thread doesn't get scheduled or somehow blocks, it may be > problematic. > > How responsive generally is the machine under load? Is there available > CPU? > > We can try to enable debugging to see what is going on.. 'debug ms = 1' > and 'debug osd = 20' is everything we would need, but will incur > additoinal load itself and may spoil the experiment... > > sage > > > ???? iPhone > > > > ? 2013-5-14?23:36?"Mark Nelson" <mark.nelson@xxxxxxxxxxx> ??? > > > > > On 05/14/2013 10:30 AM, Sage Weil wrote: > > >> On Tue, 14 May 2013, Chen, Xiaoxi wrote: > > >>> Hi > > >>> > > >>> We are suffering our OSD flipping between up and down ( OSD X be > > >>> voted to > > >>> > > >>> down due to 3 missing ping, and after a while it tells the monitor > > >>> ?map xxx > > >>> wrongly mark me down? ). Because we are running sequential write > > >>> performance test on top of RBDs, and the cluster network nics is > > >>> really in high utilization (8Gb/s+ for a 10Gb network). > > >>> > > >>> Is this a expected behavior ? or how can I prevent this > > >>> happen? > > >> > > >> You an increase the heartbeat grace period. The pings are handled by a > > >> separate thread on the backside interface (if there is one). If you > > >> are > > >> missing pings then the network or scheduler is preventing those (small) > > >> messages from being processed (there is almost no lock contention in > > >> that > > >> path). Which means it really is taking ~20 seconds or wahtever to > > >> handle > > >> those messages. It's really a questin of how unresponsive you want to > > >> permit the OSDs to be before you consider it a failure.. > > >> > > >> sage > > > > > > It might be worth testing out how long pings or other network traffic > > > are taking during these tests. There may be some tcp tunning you can > > > do here, or even consider using a separate network for the mons. > > > > > > Mark > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com