> -----Original Message----- > From: Trygve Vea [mailto:trygve.vea@xxxxxxxxxxxxxxxxxx] > Sent: 29 November 2016 14:36 > To: nick@xxxxxxxxxx > Cc: ceph-users <ceph-users@xxxxxxxx> > Subject: Re: Regarding loss of heartbeats > > ----- Den 29.nov.2016 15:20 skrev Nick Fisk nick@xxxxxxxxxx: > >> -----Original Message----- > >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf > >> Of Trygve Vea > >> Sent: 29 November 2016 14:07 > >> To: ceph-users <ceph-users@xxxxxxxx> > >> Subject: Regarding loss of heartbeats > >> > >> Since Jewel, we've seen quite a bit of funky behaviour in Ceph. I've > >> written about it a few times to the mailing list. > >> > >> Higher CPU utilization after the upgrade / Loss of heartbeats. We've > >> looked at our network setup, and we've optimized some potential > >> bottlenecks some places. > >> > >> Interesting thing regarding loss of heartbeats. We have observed > >> OSDs running on the same host losing heartbeats against eachother. > >> I'm not sure why they are connected at all (we have had some > >> remapped/degraded placement groups over the weekend, maybe that's > >> why) - but I have a hard time pointing the finger at our network when > >> the heartbeat is lost between two osds on the same server. > >> > >> > >> I've been staring myself blind at this problem for a while, and just > >> now noticed a pretty new bug report that I want to believe is > > related > >> to what I am experiencing: http://tracker.ceph.com/issues/18042 > >> > >> We had one OSD hit a suicide timeout value and kill itself off last > >> night, and one can see that several of these heartbeats are > > between > >> osds on the same node. (zgrep '10.22.9.21.*10.22.9.21' > >> ceph-osd.2.gz) > >> > >> http://employee.tv.situla.bitbit.net/ceph-osd.2.gz > >> > >> > >> Does anyone have any thoughts about this? Are we stumbling on a > >> known, or unknown bug in Ceph? > > > > Hi Trygve, > > Hi Nick! > > > I was getting similar things to you after upgrading to 10.2.3, > > definitely seeing problems where OSD's on the same nodes were marking > > each other out and the cluster was fairly idle. I found that it seemed > > to being caused by Kernel 4.7, nodes in the same cluster that were on > > 4.4 were unaffected. After downgrading all nodes to 4.4, everything > > has been really stable for me. > > I'm not sure if this can apply to our setup. Our upgrade to Jewel didn't include a kernel upgrade as far as I recall (and if it did, it was a > minor release). > > We're running 3.10.0-327.36.3.el7.x86_64, and follow latest stable kernel provided by CentOS7. We've added the latest hpsa module > as provided by HP to work around a known crash bug in that driver, but nothing special other than that. Yeah I guess you never know with the CentOS/RH kernel's what's in them compared to the vanilla kernel versions. Still if you have the possibility to roll back to a kernel from the start of the year I would be really interested to hear if it has the same effect for you. > > The problems started as of Jewel. > > > -- > Trygve _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com