Re: how to troubleshoot "heartbeat_check: no reply" in OSD log

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jared,
    did you have find a solution to your problem ? It appear that I have the same osd problem, and tcpdump captures won't show any solution.

All OSD nodes produced logs like

2017-12-14 11:25:11.756552 7f0cc5905700 -1 osd.49 29546 heartbeat_check: no reply from 172.16.5.155:6817 osd.46 since back 2017-12-14 11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 11:24:51.756201)
2017-12-14 11:25:11.756558 7f0cc5905700 -1 osd.49 29546 heartbeat_check: no reply from 172.16.5.155:6815 osd.48 since back 2017-12-14 11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 11:24:51.756201)
2017-12-14 11:25:11.756564 7f0cc5905700 -1 osd.49 29546 heartbeat_check: no reply from 172.16.5.156:6805 osd.50 since back 2017-12-14 11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 11:24:51.756201)

Sometime OSD Process was shutdown and respawn, sometime just shutdown.

We used Ubuntu 14.04 (one node is on 16.04) and ceph version 10.2.10.

Thanks
Tristan





On Fri, Jul 28, 2017 at 6:06 AM, Jared Watts <Jared.Watts at quantum.com> wrote:
> I’ve got a cluster where a bunch of OSDs are down/out (only 6/21 are up/in).
> ceph status and ceph osd tree output can be found at:
>
> https://gist.github.com/jbw976/24895f5c35ef0557421124f4b26f6a12
>
>
>
> In osd.4 log, I see many of these:
>
> 2017-07-27 19:38:53.468852 7f3855c1c700 -1 osd.4 120 heartbeat_check: no
> reply from 10.32.0.3:6807 osd.15 ever on either front or back, first ping
> sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850)
>
> 2017-07-27 19:38:53.468881 7f3855c1c700 -1 osd.4 120 heartbeat_check: no
> reply from 10.32.0.3:6811 osd.16 ever on either front or back, first ping
> sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850)
>
>
>
> From osd.4, those endpoints look reachable:
>
> / # nc -vz 10.32.0.3 6807
>
> 10.32.0.3 (10.32.0.3:6807) open
>
> / # nc -vz 10.32.0.3 6811
>
> 10.32.0.3 (10.32.0.3:6811) open
>
>
>
> What else can I look at to determine why most of the OSDs cannot
> communicate?  http://tracker.ceph.com/issues/16092 indicates this behavior
> is a networking or hardware issue, what else can I check there?  I can turn
> on extra logging as needed.  Thanks!

Do a packet capture on both machines at the same time and verify the
packets are arriving as expected.

>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
begin:vcard
fn:Tristan Le Toullec
n:Le Toullec;Tristan
org:CNRS;LOPS
adr:;;rue Dumont D'Urville;PLOUZANE;;29280;France
email;internet:tristan.letoullec@xxxxxxx
title:System Admin
tel;work:0290915544
version:2.1
end:vcard

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux