Re: Regarding loss of heartbeats

Trygve Vea <trygve.vea@xxxxxxxxxxxxxxxxxx> · Tue, 29 Nov 2016 15:35:47 +0100 (CET)

----- Den 29.nov.2016 15:20 skrev Nick Fisk nick@xxxxxxxxxx:
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Trygve
>> Vea
>> Sent: 29 November 2016 14:07
>> To: ceph-users <ceph-users@xxxxxxxx>
>> Subject:  Regarding loss of heartbeats
>> 
>> Since Jewel, we've seen quite a bit of funky behaviour in Ceph.  I've written
>> about it a few times to the mailing list.
>> 
>> Higher CPU utilization after the upgrade / Loss of heartbeats.  We've looked at
>> our network setup, and we've optimized some
>> potential bottlenecks some places.
>> 
>> Interesting thing regarding loss of heartbeats.  We have observed OSDs running
>> on the same host losing heartbeats against
>> eachother.  I'm not sure why they are connected at all (we have had some
>> remapped/degraded placement groups over the weekend,
>> maybe that's why) - but I have a hard time pointing the finger at our network
>> when the heartbeat is lost between two osds on the
>> same server.
>> 
>> 
>> I've been staring myself blind at this problem for a while, and just now noticed
>> a pretty new bug report that I want to believe is
> related
>> to what I am experiencing: http://tracker.ceph.com/issues/18042
>> 
>> We had one OSD hit a suicide timeout value and kill itself off last night, and
>> one can see that several of these heartbeats are
> between
>> osds on the same node.  (zgrep '10.22.9.21.*10.22.9.21' ceph-osd.2.gz)
>> 
>> http://employee.tv.situla.bitbit.net/ceph-osd.2.gz
>> 
>> 
>> Does anyone have any thoughts about this?  Are we stumbling on a known, or
>> unknown bug in Ceph?
> 
> Hi Trygve,

Hi Nick!

> I was getting similar things to you after upgrading to 10.2.3, definitely seeing
> problems where OSD's on the same nodes were marking
> each other out and the cluster was fairly idle. I found that it seemed to being
> caused by Kernel 4.7, nodes in the same cluster that
> were on 4.4 were unaffected. After downgrading all nodes to 4.4, everything has
> been really stable for me.

I'm not sure if this can apply to our setup.  Our upgrade to Jewel didn't include a kernel upgrade as far as I recall (and if it did, it was a minor release).

We're running 3.10.0-327.36.3.el7.x86_64, and follow latest stable kernel provided by CentOS7.  We've added the latest hpsa module as provided by HP to work around a known crash bug in that driver, but nothing special other than that.

The problems started as of Jewel.

-- 
Trygve
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com