Re: how to troubleshoot "heartbeat_check: no reply" in OSD log

Tristan Le Toullec <tristan.letoullec@xxxxxxx> · Thu, 14 Dec 2017 11:31:23 +0100

    Hi Jared,

          did you have find a solution to your problem ? It appear that
      I have the same osd problem, and tcpdump captures won't show any
      solution. 

    All OSD nodes produced logs like 

      2017-12-14 11:25:11.756552 7f0cc5905700 -1 osd.49 29546
      heartbeat_check: no reply from 172.16.5.155:6817 osd.46 since back
      2017-12-14 11:24:44.252310 front 2017-12-14 11:24:44.252310
      (cutoff 2017-12-14 11:24:51.756201)

      2017-12-14 11:25:11.756558 7f0cc5905700 -1 osd.49 29546
      heartbeat_check: no reply from 172.16.5.155:6815 osd.48 since back
      2017-12-14 11:24:44.252310 front 2017-12-14 11:24:44.252310
      (cutoff 2017-12-14 11:24:51.756201)

      2017-12-14 11:25:11.756564 7f0cc5905700 -1 osd.49 29546
      heartbeat_check: no reply from 172.16.5.156:6805 osd.50 since back
      2017-12-14 11:24:44.252310 front 2017-12-14 11:24:44.252310
      (cutoff 2017-12-14 11:24:51.756201)
    Sometime OSD Process was shutdown and respawn, sometime just
    shutdown. 

    We used Ubuntu 14.04 (one node is on 16.04) and ceph version
    10.2.10.

    Thanks

    Tristan

    On Fri, Jul 28, 2017 at 6:06 AM, Jared Watts <Jared.Watts at quantum.com> wrote:
> I’ve got a cluster where a bunch of OSDs are down/out (only 6/21 are up/in).
> ceph status and ceph osd tree output can be found at:
>
> https://gist.github.com/jbw976/24895f5c35ef0557421124f4b26f6a12
>
>
>
> In osd.4 log, I see many of these:
>
> 2017-07-27 19:38:53.468852 7f3855c1c700 -1 osd.4 120 heartbeat_check: no
> reply from 10.32.0.3:6807 osd.15 ever on either front or back, first ping
> sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850)
>
> 2017-07-27 19:38:53.468881 7f3855c1c700 -1 osd.4 120 heartbeat_check: no
> reply from 10.32.0.3:6811 osd.16 ever on either front or back, first ping
> sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850)
>
>
>
> From osd.4, those endpoints look reachable:
>
> / # nc -vz 10.32.0.3 6807
>
> 10.32.0.3 (10.32.0.3:6807) open
>
> / # nc -vz 10.32.0.3 6811
>
> 10.32.0.3 (10.32.0.3:6811) open
>
>
>
> What else can I look at to determine why most of the OSDs cannot
> communicate?  http://tracker.ceph.com/issues/16092 indicates this behavior
> is a networking or hardware issue, what else can I check there?  I can turn
> on extra logging as needed.  Thanks!

Do a packet capture on both machines at the same time and verify the
packets are arriving as expected.

>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

begin:vcard
fn:Tristan Le Toullec
n:Le Toullec;Tristan
org:CNRS;LOPS
adr:;;rue Dumont D'Urville;PLOUZANE;;29280;France
email;internet:tristan.letoullec@xxxxxxx
title:System Admin
tel;work:0290915544
version:2.1
end:vcard

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com