16 osds: 11 up, 16 in

hell@xxxxxxxxxxx (Sergey Malinin) · Wed, 7 May 2014 23:40:34 +0300

Check dmesg and SMART data on both nodes. This behaviour is similar to failing hdd. 

On Wednesday, May 7, 2014 at 23:28, Craig Lewis wrote:

> On 5/7/14 13:15 , Sergey Malinin wrote:
> > Is there anything unusual in dmesg at osd.5? 
> 
> Nothing in dmesg, but ceph-osd.5.log has plenty.  I've attached the log after the restart.  Logging levels are normal.
> 
> What jumps out at me is:
> 2014-05-07 12:48:02.640164 7ff65d439700 -1 osd.5 38870 heartbeat_check: no reply from osd.8 ever on either front or back, first ping sent 2014-05-07 12:47:42.335591 (cutoff 2014-05-07 12:47:42.640163)
> 2014-05-07 12:48:02.640174 7ff65d439700 -1 osd.5 38870 heartbeat_check: no reply from osd.11 ever on either front or back, first ping sent 2014-05-07 12:47:42.335591 (cutoff 2014-05-07 12:47:42.640163)
> 2014-05-07 12:48:02.640180 7ff65d439700 -1 osd.5 38870 heartbeat_check: no reply from osd.12 ever on either front or back, first ping sent 2014-05-07 12:47:42.335591 (cutoff 2014-05-07 12:47:42.640163)
> 2014-05-07 12:48:02.640186 7ff65d439700 -1 osd.5 38870 heartbeat_check: no reply from osd.13 ever on either front or back, first ping sent 2014-05-07 12:47:42.335591 (cutoff 2014-05-07 12:47:42.640163)
> 
> osd.5 is on host ceph0c.
> osd.8, osd.11, osd.12, and osd.13 are all on ceph1c (along with 4 other OSDs).  Both the front and back network are working fine, and I can connect to osd.8 from host ceph0 just fine.  These 4 OSDs are not reporting problems.
> 
> The other OSDs that are flapping are osd.6 and osd.15.
> 
> osd.15 says:
> 2014-05-07 13:25:44.626840 7fe312c9d700 -1 osd.15 38891 heartbeat_check: no reply from osd.5 since back 2014-05-07 13:10:01.239883 front 2014-05-07 13:10:01.239883 (cutoff 2014-05-07 13:25:24.626838)
> 2014-05-07 13:25:44.626849 7fe312c9d700 -1 osd.15 38891 heartbeat_check: no reply from osd.11 since back 2014-05-07 13:22:48.592121 front 2014-05-07 13:22:48.592121 (cutoff 2014-05-07 13:25:24.626838)
> 
> osd.6 says:
> 2014-05-07 13:26:15.409217 7f35e50a7700 -1 osd.6 38891 heartbeat_check: no reply from osd.5 since back 2014-05-07 13:09:57.440713 front 2014-05-07 13:09:57.440713 (cutoff 2014-05-07 13:25:55.409216)
> 2014-05-07 13:26:15.409227 7f35e50a7700 -1 osd.6 38891 heartbeat_check: no reply from osd.11 since back 2014-05-07 13:22:50.353671 front 2014-05-07 13:22:50.353671 (cutoff 2014-05-07 13:25:55.409216)
> 2014-05-07 13:26:15.409235 7f35e50a7700 -1 osd.6 38891 heartbeat_check: no reply from osd.13 since back 2014-05-07 13:11:26.959761 front 2014-05-07 13:11:26.959761 (cutoff 2014-05-07 13:25:55.409216)
> 2014-05-07 13:26:15.409306 7f35e13e5700  0 -- 10.194.0.6:0/17586 >> 10.194.0.7:6803/19641 pipe(0x1c4d7500 sd=79 :56788 s=1 pgs=0 cs=0 l=1 c=0x1c646840).connect claims to be 10.194.0.7:6803/1019705 not 10.194.0.7:6803/19641 - wrong node!
> 
> osd.11 and osd.13 have been kicked out for being unresponsive, but they don't have any heartbeat_check entries in their logs.
> 
> 
> 
> 
> -- 
> Craig Lewis 
> Senior Systems Engineer
> Office +1.714.602.1309
> Email clewis at centraldesktop.com (mailto:clewis at centraldesktop.com) 
> Central Desktop. Work together in ways you never thought possible. 
> Connect with us   Website (http://www.centraldesktop.com/)  |  Twitter (http://www.twitter.com/centraldesktop)  |  Facebook (http://www.facebook.com/CentralDesktop)  |  LinkedIn (http://www.linkedin.com/groups?gid=147417)  |  Blog (http://cdblog.centraldesktop.com/) 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com (mailto:ceph-users at lists.ceph.com)
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> Attachments: 
> - ceph-osd.5.log
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140507/46ed6bbc/attachment.htm>