Re: Repeated messages of "heartbeat_check: no heartbeat from"

Wido den Hollander <wido@xxxxxxxxx> · Thu, 27 Oct 2011 08:20:14 +0200

Hi Sage,

On 10/26/2011 01:06 AM, Sage Weil wrote:
Hi Wido,

Finally got to looking through these logs :( and I see osd.12 pinging lots
of others OSDs but not osd.0.  The logs don't go far back enough to see
osd.0 requesting heartbeats, though.

Ah, bummer :(

Is this something you are able to reproduce?  A similar log for the
marker-down that extends back to when the process started will have more
info.

No, not really. But I've seen it happening when I restarted my whole 
cluster and fired up the OSD's rapidly after eachother.

Not sure if I'm able to reproduce, but I sure can give it a try.

Wido

Thanks!
sage

On Mon, 17 Oct 2011, Wido den Hollander wrote:

Hi,

On 10/14/2011 05:16 PM, Sage Weil wrote:
On Fri, 14 Oct 2011, Wido den Hollander wrote:
Hi,

I'm seeing this behavior again. In the monitor messages this started to
appear:

2011-10-14 16:22:19.853752   log 2011-10-14 16:22:14.887881 mon.0
[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 3281 : [INF] osd.12
[2a00:f10:113:1:225:90ff:fe33:49b0]:6800/8023 failed (by osd.0
[2a00:f10:113:1:225:90ff:fe33:49fe]:6800/10678)

So, osd.0 is telling us that osd.12 is down, but the proces is running
and seems alive.

The logs are saying:

Oct 14 16:20:52 atom0 osd.0[10679]: 7f81deeac700 osd.0 511
heartbeat_check: no heartbeat from osd.12 since 2011-10-14
16:00:07.636627 (cutoff 2011-10-14 16:20:32.025154)
Oct 14 16:20:52 atom0 osd.0[10679]: 7f81d2d93700 osd.0 511
heartbeat_check: no heartbeat from osd.12 since 2011-10-14
16:00:07.636627 (cutoff 2011-10-14 16:20:32.108369)
Oct 14 16:20:53 atom0 osd.0[10679]: 7f81deeac700 osd.0 511
heartbeat_check: no heartbeat from osd.12 since 2011-10-14
16:00:07.636627 (cutoff 2011-10-14 16:20:33.025462)

While osd.12 reports getting messages from osd.0

Oct 14 16:20:54 atom3 osd.12[8024]: 7fa3621bb700 --
[2a00:f10:113:1:225:90ff:fe33:49b0]:0/8024<== osd.0
[2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 1317 ====
osd_ping(heartbeat e511 as_of 355) v1 ==== 61+0+0 (2126910847 0 0)
0x1e281c0 con 0x1f02c80

I see osd.0 sending pg_info messages:

Oct 14 16:20:58 atom0 osd.0[10679]: 7f81d769d700 --
[2a00:f10:113:1:225:90ff:fe33:49fe]:6812/10678 -->   osd.12
[2a00:f10:113:1:225:90ff:fe33:49b0]:6801/8023 -- pg_info(1 pgs e512) v1
-- ?+0 0x3bf1e00

And osd.12 is also sending stuff back:

Oct 14 16:21:06 atom3 osd.12[8024]: 7fa3629bc700 --
[2a00:f10:113:1:225:90ff:fe33:49b0]:6801/8023 -->   osd.0
[2a00:f10:113:1:225:90ff:fe33:49fe]:6812/10678 -- pg_info(1 pgs e518) v1
-- ?+0 0x6f3e540

I'm running the latest version, but I did do a complete powercycle of my
cluster this morning. It's after the powercycle these messages started
appearing.

Any ideas?
Can you post full osd logs for a pair where there was some marking-down
occuring?  I'll take a look.

Sure! Attached in the GZIP

I'm using remote-syslog, so the logs might be not how you want them. Let me
know if so!

I used "ceph osd tell 0 injectargs '--debug-ms 20'" to set the debugging
higher.

What I noticed was this:

$ tail -f osd.log |grep "osd\.0"|grep "osd\.12"

Oct 17 17:48:33 monitor mon.dga[560]: 7fa5fc520700 log [INF] : osd.12
[2a00:f10:113:1:225:90ff:fe33:49b0]:6800/8023 failed (by osd.0
[2a00:f10:113:1:225:90ff:fe33:49fe]:6800/10678)
Oct 17 17:48:33 monitor mon.dga[560]: 2011-10-17 17:48:33.930666 mon.0
[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 2440 : [INF] osd.12
[2a00:f10:113:1:225:90ff:fe33:49b0]:6800/8023 failed (by osd.0
[2a00:f10:113:1:225:90ff:fe33:49fe]:6800/10678)
Oct 17 17:48:36 atom3 osd.12[8024]: 7fa356039700 --
[2a00:f10:113:1:225:90ff:fe33:49b0]:0/8024>>
[2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 pipe(0x6e0e000 sd=20 pgs=133
cs=1 l=0).reader got envelope type=70 src osd.0 front=61 data=0 off 0
Oct 17 17:48:36 atom3 osd.12[8024]: 7fa3621bb700 --
[2a00:f10:113:1:225:90ff:fe33:49b0]:0/8024<== osd.0
[2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 278150 ==== osd_ping(heartbeat
e577 as_of 355) v1 ==== 61+0+0 (2665319624 0 0) 0x52e1540 con 0x1f02c80

$ tail -f osd.log |grep "osd\.0"|grep "osd\.15" (Random other OSD)

Oct 17 17:49:29 atom3 osd.15[10295]: 7fd2ee88c700 --
[2a00:f10:113:1:225:90ff:fe33:49b0]:0/10295<== osd.0
[2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 277738 ==== osd_ping(heartbeat
e577 as_of 404) v1 ==== 61+0+0 (1619141758 0 0) 0x33568c0 con 0x56a0c80
Oct 17 17:49:33 atom3 osd.15[10295]: 7fd2ee88c700 --
[2a00:f10:113:1:225:90ff:fe33:49b0]:0/10295<== osd.0
[2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 277742 ==== osd_ping(heartbeat
e577 as_of 404) v1 ==== 61+0+0 (116581228 0 0) 0x6739c40 con 0x56a0c80
Oct 17 17:49:34 atom0 osd.0[10679]: 7f81c91ff700 --
[2a00:f10:113:1:225:90ff:fe33:49fe]:0/10679>>
[2a00:f10:113:1:225:90ff:fe33:49b0]:6811/10294 pipe(0x4bbda00 sd=87 pgs=19
cs=1 l=0).reader got envelope type=70 src osd.15 front=61 data=0 off 0
Oct 17 17:49:34 atom0 osd.0[10679]: 7f81d6e9c700 --
[2a00:f10:113:1:225:90ff:fe33:49fe]:0/10679<== osd.15
[2a00:f10:113:1:225:90ff:fe33:49b0]:6811/10294 278117 ==== osd_ping(heartbeat
e577 as_of 404) v1 ==== 61+0+0 (116581228 0 0) 0x2fd5c40 con 0x5a6cb40

If you look at the log you'll see that osd.0 is reporting that osd.12 and
osd.39 are down.

The logs show me that osd.0 is never communicating with osd.12 nor osd.39 and
simply reports them as down.

The last time I saw this it was exactly the same. One OSD was repeating that
two other OSD's were down.

My guess is that osd.0 is the problem here.

Thanks!
sage

Wido

On Wed, 2011-08-10 at 12:30 +0200, Wido den Hollander wrote:
Hi,

I tried reproducing this today with the new heartbeat code, couldn't.

The new heartbeat code seems to work much better under load, I haven't
seen any bouncing like I saw before.

Wido

On Fri, 2011-08-05 at 14:11 +0200, Wido den Hollander wrote:
Hi,

I'm seeing something weird which I don't want to report in a issue yet
in the tracker.

I suffered a pretty big crash which caused me to loose 15 of my 40
OSDs.
Some died because my WD Green 2TB disks failed, others since btrfs
didn't survive the hard reboot I had to give them.

When watching 'ceph -w' I noticed this:

2011-08-05 13:32:03.209958    pg v504313: 8008 pgs: 1 inactive, 334
active+clean, 10 peering, 1 crashed+peering, 14 down+peering, 7 active
+clean+degraded, 7624 crashed+down+peering, 8 down+degraded+peering, 9
crashed+down+degraded+peering; 4417 GB data, 13121 GB used, 32504 GB /
46575 GB avail; 8255/3391107 degraded (0.243%)
2011-08-05 13:32:03.225712   mds e6033: 1/1/1 up
{0=dga=up:replay(laggy
or crashed)}
2011-08-05 13:32:03.225747   osd e9934: 40 osds: 25 up, 25 in
2011-08-05 13:32:03.225874   log 2011-08-05 13:32:01.646596 mon0
[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29796 : [INF] osd37
[2a00:f10:113:1:225:90ff:fe33:49a4]:6803/17881 failed (by osd7
[2a00:f10:113:1:225:90ff:fe32:cf64]:6806/2746)
2011-08-05 13:32:03.225964   mon e1: 1 mons at
{dga=[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0}
2011-08-05 13:32:03.929650   log 2011-08-05 13:32:03.603681 mon0
[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29797 : [INF] osd19
[2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 failed (by osd5
[2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506)

Here you see that osd5 is reporting that osd37 is down and osd7 is
reporting that osd19 is down.

These messages keep being repeated, over and over. My 'debug ms' is at
1, so I took a look at the logs:

Aug  5 13:32:41 atom1 osd.5[2507]: 7f75227fc700 osd5 9963
heartbeat_check: no heartbeat from osd19 since 2011-08-04
17:01:02.504942 (cutoff 2011-08-05 13:32:21.916364)
Aug  5 13:32:42 atom1 osd.5[2507]: 7f7531579700 osd5 9963
heartbeat_check: no heartbeat from osd19 since 2011-08-04
17:01:02.504942 (cutoff 2011-08-05 13:32:22.381632)
Aug  5 13:32:43 atom1 osd.5[2507]: 7f7531579700 osd5 9965
heartbeat_check: no heartbeat from osd19 since 2011-08-04
17:01:02.504942 (cutoff 2011-08-05 13:32:23.474198)
Aug  5 13:32:44 atom1 osd.5[2507]: 7f75227fc700 osd5 9965
heartbeat_check: no heartbeat from osd19 since 2011-08-04
17:01:02.504942 (cutoff 2011-08-05 13:32:24.027161)

osd5 keeps claiming that osd19 is down, but when I check the logs I
see
that osd19 is running and exchanging heartbeat messages with the other
OSDs in the cluster.

What seems odd to me is that there are no log messages about osd7
claiming that osd37 is down, although it keeps being repeated by 'ceph
-w'.

$ ceph osd injectargs 7 '--debug-ms 1'
$ ceph osd injectargs 37 '--debug-ms 1'

I made sure that was injected, but still, no log messages.

Going through the logs a bit further (after setting the logging to 20)
I
saw osd.19 reporting:

Aug  5 13:48:23 atom4 osd.19[2741]: 7f634ca8e700 --
[2a00:f10:113:1:225:90ff:fe33:49cc]:6810/2740>>
[2a00:f10:113:1:225:90ff:fe32:cf64]:6808/2506 pipe(0x7f64e4193ec0
sd=56
pgs=1530 cs=1 l=0).reader got envelope type=70 src osd5 front=61
data=0
off 0

It seems that it is communicating with osd5, but osd5 does not seem to
communicate with osd19?

The log also confirms that:

Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
[2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 -->
[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 -- osd_failure(osd19
[2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1 -- ?+0
0x6948000 con 0x1cd7710
Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
[2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 submit_message
osd_failure(osd19 [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020
v10020) v1 remote [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0
Aug  5 14:10:05 atom1 osd.5[2507]: 7f75365a1700 --
[2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506>>
[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 pipe(0x1cd74a0 sd=13
pgs=178
cs=1 l=1).writer encoding 17093 0x6948000 osd_failure(osd19
[2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1

Now, to prevent (I tried...) my cluster from bouncing around I placed
this in the OSD section:

      osd min down reporters = 3
      osd min down reports = 5

Since only osd5 and osd7 are sending out these messages osd19 nor
osd37
get marked down, but what is going on here?

I haven't restarted the OSDs yet.

I noticed that the OSD heartbeat process is being revised currently,
but
I thought it might be helpful to report this.

Wido

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html