Re: Repeated messages of "heartbeat_check: no heartbeat from"

Wido den Hollander <wido@xxxxxxxxx> · Tue, 28 Feb 2012 16:42:03 +0100

Hi,

On 02/24/2012 06:18 AM, Gregory Farnum wrote:
On Thu, Feb 23, 2012 at 2:45 AM, Wido den Hollander<wido@xxxxxxxxx>  wrote:
Hi,

On 02/22/2012 07:08 PM, Gregory Farnum wrote:

Wido,
Sorry we lost track of this last week — we were all distracted by FAST 12!
:)

No problem!

So it looks like they're both on the same map and osd.4 is sending
pings to osd.19, but osd.19 is just ignoring them? Or do you really
have on debug_os and not debug_osd? :)

That was a typo, I have debug_osd set to 20.

I haven't rebooted the OSD's since and now osd.4 and osd.19 are not
complaining anymore, but it's now a different set of OSD's who are saying
the other one is down.

I'm still running v0.41 btw. I'm not going to touch the cluster until this
one is tracked down, it keeps coming back.

Suggestions?

Well, like Sage said long ago, this will be easiest to diagnose if
there are logs available for both OSDs that cover the entire time
after one requested heartbeats from the other.

If you do have these and can post them somewhere, I'm sure Sage or I
will find it interesting enough to look through...  ;)
If not, I'm out of ideas, although I'm not super-familiar with the
heartbeat code since Sage rewrote it so we may be able to come up with
something if we discuss it more.
-Greg

I created an issue for this with logs attached: 
http://tracker.newdream.net/issues/2116

Thanks,

Wido

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html