On 02/27/2012 06:03 PM, Sage Weil wrote:
On Mon, 27 Feb 2012, Székelyi Szabolcs wrote:
Hello,
whenever I restart osd.0 I see a pair of messages like
2012-02-27 17:26:00.132666 mon.0<osd_1_ip>:6789/0 106 : [INF] osd.0
<osd_0_ip>:6801/29931 failed (by osd.1<osd_1_ip>:6806/20125)
2012-02-27 17:26:21.074926 osd.0<osd_0_ip>:6801/29931 1 : [WRN] map e370
wrongly marked me down or wrong addr
a couple of times. The situation stabilizes in a normal state after about two
minutes.
Should I worry about this? Maybe the first message is about the just killed
OSD, and the second comes from the new incarnation, and this is completely
normal? This is Ceph 0.41.
It's not normal. Wido was seeing something similar, I think. I suspect
the problem is that during startup ceph-osd just busy, but the heartbeat
code is such that it's not supposed to miss them.
I haven't seen the wrongly marked me down messages, I'm just seeing that
'pairs' of OSD's are marking the other down.
Still trying to figure that one out.
Can you reproduce this with 'debug ms = 1'?
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html