Check dmesg and SMART data on both nodes. This behaviour is similar to failing hdd. On Wednesday, May 7, 2014 at 23:28, Craig Lewis wrote: > On 5/7/14 13:15 , Sergey Malinin wrote: > > Is there anything unusual in dmesg at osd.5? > > Nothing in dmesg, but ceph-osd.5.log has plenty. I've attached the log after the restart. Logging levels are normal. > > What jumps out at me is: > 2014-05-07 12:48:02.640164 7ff65d439700 -1 osd.5 38870 heartbeat_check: no reply from osd.8 ever on either front or back, first ping sent 2014-05-07 12:47:42.335591 (cutoff 2014-05-07 12:47:42.640163) > 2014-05-07 12:48:02.640174 7ff65d439700 -1 osd.5 38870 heartbeat_check: no reply from osd.11 ever on either front or back, first ping sent 2014-05-07 12:47:42.335591 (cutoff 2014-05-07 12:47:42.640163) > 2014-05-07 12:48:02.640180 7ff65d439700 -1 osd.5 38870 heartbeat_check: no reply from osd.12 ever on either front or back, first ping sent 2014-05-07 12:47:42.335591 (cutoff 2014-05-07 12:47:42.640163) > 2014-05-07 12:48:02.640186 7ff65d439700 -1 osd.5 38870 heartbeat_check: no reply from osd.13 ever on either front or back, first ping sent 2014-05-07 12:47:42.335591 (cutoff 2014-05-07 12:47:42.640163) > > osd.5 is on host ceph0c. > osd.8, osd.11, osd.12, and osd.13 are all on ceph1c (along with 4 other OSDs). Both the front and back network are working fine, and I can connect to osd.8 from host ceph0 just fine. These 4 OSDs are not reporting problems. > > The other OSDs that are flapping are osd.6 and osd.15. > > osd.15 says: > 2014-05-07 13:25:44.626840 7fe312c9d700 -1 osd.15 38891 heartbeat_check: no reply from osd.5 since back 2014-05-07 13:10:01.239883 front 2014-05-07 13:10:01.239883 (cutoff 2014-05-07 13:25:24.626838) > 2014-05-07 13:25:44.626849 7fe312c9d700 -1 osd.15 38891 heartbeat_check: no reply from osd.11 since back 2014-05-07 13:22:48.592121 front 2014-05-07 13:22:48.592121 (cutoff 2014-05-07 13:25:24.626838) > > osd.6 says: > 2014-05-07 13:26:15.409217 7f35e50a7700 -1 osd.6 38891 heartbeat_check: no reply from osd.5 since back 2014-05-07 13:09:57.440713 front 2014-05-07 13:09:57.440713 (cutoff 2014-05-07 13:25:55.409216) > 2014-05-07 13:26:15.409227 7f35e50a7700 -1 osd.6 38891 heartbeat_check: no reply from osd.11 since back 2014-05-07 13:22:50.353671 front 2014-05-07 13:22:50.353671 (cutoff 2014-05-07 13:25:55.409216) > 2014-05-07 13:26:15.409235 7f35e50a7700 -1 osd.6 38891 heartbeat_check: no reply from osd.13 since back 2014-05-07 13:11:26.959761 front 2014-05-07 13:11:26.959761 (cutoff 2014-05-07 13:25:55.409216) > 2014-05-07 13:26:15.409306 7f35e13e5700 0 -- 10.194.0.6:0/17586 >> 10.194.0.7:6803/19641 pipe(0x1c4d7500 sd=79 :56788 s=1 pgs=0 cs=0 l=1 c=0x1c646840).connect claims to be 10.194.0.7:6803/1019705 not 10.194.0.7:6803/19641 - wrong node! > > osd.11 and osd.13 have been kicked out for being unresponsive, but they don't have any heartbeat_check entries in their logs. > > > > > -- > Craig Lewis > Senior Systems Engineer > Office +1.714.602.1309 > Email clewis at centraldesktop.com (mailto:clewis at centraldesktop.com) > Central Desktop. Work together in ways you never thought possible. > Connect with us Website (http://www.centraldesktop.com/) | Twitter (http://www.twitter.com/centraldesktop) | Facebook (http://www.facebook.com/CentralDesktop) | LinkedIn (http://www.linkedin.com/groups?gid=147417) | Blog (http://cdblog.centraldesktop.com/) > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com (mailto:ceph-users at lists.ceph.com) > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > Attachments: > - ceph-osd.5.log > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140507/46ed6bbc/attachment.htm>