Re: Regarding loss of heartbeats

Nick Fisk <nick@xxxxxxxxxx> · Tue, 29 Nov 2016 14:47:00 -0000

> -----Original Message-----
> From: Trygve Vea [mailto:trygve.vea@xxxxxxxxxxxxxxxxxx]
> Sent: 29 November 2016 14:36
> To: nick@xxxxxxxxxx
> Cc: ceph-users <ceph-users@xxxxxxxx>
> Subject: Re: Regarding loss of heartbeats
> 
> ----- Den 29.nov.2016 15:20 skrev Nick Fisk nick@xxxxxxxxxx:
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >> Of Trygve Vea
> >> Sent: 29 November 2016 14:07
> >> To: ceph-users <ceph-users@xxxxxxxx>
> >> Subject:  Regarding loss of heartbeats
> >>
> >> Since Jewel, we've seen quite a bit of funky behaviour in Ceph.  I've
> >> written about it a few times to the mailing list.
> >>
> >> Higher CPU utilization after the upgrade / Loss of heartbeats.  We've
> >> looked at our network setup, and we've optimized some potential
> >> bottlenecks some places.
> >>
> >> Interesting thing regarding loss of heartbeats.  We have observed
> >> OSDs running on the same host losing heartbeats against eachother.
> >> I'm not sure why they are connected at all (we have had some
> >> remapped/degraded placement groups over the weekend, maybe that's
> >> why) - but I have a hard time pointing the finger at our network when
> >> the heartbeat is lost between two osds on the same server.
> >>
> >>
> >> I've been staring myself blind at this problem for a while, and just
> >> now noticed a pretty new bug report that I want to believe is
> > related
> >> to what I am experiencing: http://tracker.ceph.com/issues/18042
> >>
> >> We had one OSD hit a suicide timeout value and kill itself off last
> >> night, and one can see that several of these heartbeats are
> > between
> >> osds on the same node.  (zgrep '10.22.9.21.*10.22.9.21'
> >> ceph-osd.2.gz)
> >>
> >> http://employee.tv.situla.bitbit.net/ceph-osd.2.gz
> >>
> >>
> >> Does anyone have any thoughts about this?  Are we stumbling on a
> >> known, or unknown bug in Ceph?
> >
> > Hi Trygve,
> 
> Hi Nick!
> 
> > I was getting similar things to you after upgrading to 10.2.3,
> > definitely seeing problems where OSD's on the same nodes were marking
> > each other out and the cluster was fairly idle. I found that it seemed
> > to being caused by Kernel 4.7, nodes in the same cluster that were on
> > 4.4 were unaffected. After downgrading all nodes to 4.4, everything
> > has been really stable for me.
> 
> I'm not sure if this can apply to our setup.  Our upgrade to Jewel didn't include a kernel upgrade as far as I recall (and if it did, it was a
> minor release).
> 
> We're running 3.10.0-327.36.3.el7.x86_64, and follow latest stable kernel provided by CentOS7.  We've added the latest hpsa module
> as provided by HP to work around a known crash bug in that driver, but nothing special other than that.

Yeah I guess you never know with the CentOS/RH kernel's what's in them compared to the vanilla kernel versions. Still if you have the possibility to roll back to a kernel from the start of the year I would be really interested to hear if it has the same effect for you.

> 
> The problems started as of Jewel.
> 
> 
> --
> Trygve

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com