----- Den 29.nov.2016 15:20 skrev Nick Fisk nick@xxxxxxxxxx: >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Trygve >> Vea >> Sent: 29 November 2016 14:07 >> To: ceph-users <ceph-users@xxxxxxxx> >> Subject: Regarding loss of heartbeats >> >> Since Jewel, we've seen quite a bit of funky behaviour in Ceph. I've written >> about it a few times to the mailing list. >> >> Higher CPU utilization after the upgrade / Loss of heartbeats. We've looked at >> our network setup, and we've optimized some >> potential bottlenecks some places. >> >> Interesting thing regarding loss of heartbeats. We have observed OSDs running >> on the same host losing heartbeats against >> eachother. I'm not sure why they are connected at all (we have had some >> remapped/degraded placement groups over the weekend, >> maybe that's why) - but I have a hard time pointing the finger at our network >> when the heartbeat is lost between two osds on the >> same server. >> >> >> I've been staring myself blind at this problem for a while, and just now noticed >> a pretty new bug report that I want to believe is > related >> to what I am experiencing: http://tracker.ceph.com/issues/18042 >> >> We had one OSD hit a suicide timeout value and kill itself off last night, and >> one can see that several of these heartbeats are > between >> osds on the same node. (zgrep '10.22.9.21.*10.22.9.21' ceph-osd.2.gz) >> >> http://employee.tv.situla.bitbit.net/ceph-osd.2.gz >> >> >> Does anyone have any thoughts about this? Are we stumbling on a known, or >> unknown bug in Ceph? > > Hi Trygve, Hi Nick! > I was getting similar things to you after upgrading to 10.2.3, definitely seeing > problems where OSD's on the same nodes were marking > each other out and the cluster was fairly idle. I found that it seemed to being > caused by Kernel 4.7, nodes in the same cluster that > were on 4.4 were unaffected. After downgrading all nodes to 4.4, everything has > been really stable for me. I'm not sure if this can apply to our setup. Our upgrade to Jewel didn't include a kernel upgrade as far as I recall (and if it did, it was a minor release). We're running 3.10.0-327.36.3.el7.x86_64, and follow latest stable kernel provided by CentOS7. We've added the latest hpsa module as provided by HP to work around a known crash bug in that driver, but nothing special other than that. The problems started as of Jewel. -- Trygve _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com