Re: [WRN] map e### wrongly marked me down or wrong addr

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Tue, 28 Feb 2012 08:16:34 -0800



2012/2/28 Székelyi Szabolcs <szekelyi@xxxxxxx>:
> On 2012. February 27. 09:03:11 Sage Weil wrote:
>> On Mon, 27 Feb 2012, Székelyi Szabolcs wrote:
>> > whenever I restart osd.0 I see a pair of messages like
>> >
>> > 2012-02-27 17:26:00.132666 mon.0 <osd_1_ip>:6789/0 106 : [INF] osd.0
>> > <osd_0_ip>:6801/29931 failed (by osd.1 <osd_1_ip>:6806/20125)
>> > 2012-02-27 17:26:21.074926 osd.0 <osd_0_ip>:6801/29931 1 : [WRN] map
>> > e370
>> > wrongly marked me down or wrong addr
>> >
>> > a couple of times. The situation stabilizes in a normal state after
>> > about two minutes.
>> >
>> > Should I worry about this? Maybe the first message is about the just
>> > killed OSD, and the second comes from the new incarnation, and this is
>> > completely normal? This is Ceph 0.41.
>>
>> It's not normal.  Wido was seeing something similar, I think.  I suspect
>> the problem is that during startup ceph-osd just busy, but the heartbeat
>> code is such that it's not supposed to miss them.
>>
>> Can you reproduce this with 'debug ms = 1'?
>
> Yes, I managed to. Output of ceph -w attached (with IP addresses mangled). My
> setup is 3 nodes, node 1 and 2 running OSD, MDS and MON, node 3 running MON
> only. I also have the logs from all nodes in case you need it.

Yes, please. Just the cluster state is not very helpful — we want to
see why the OSDs are marking each other down, not when. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html