2012/2/28 Székelyi Szabolcs <szekelyi@xxxxxxx>: > On 2012. February 27. 09:03:11 Sage Weil wrote: >> On Mon, 27 Feb 2012, Székelyi Szabolcs wrote: >> > whenever I restart osd.0 I see a pair of messages like >> > >> > 2012-02-27 17:26:00.132666 mon.0 <osd_1_ip>:6789/0 106 : [INF] osd.0 >> > <osd_0_ip>:6801/29931 failed (by osd.1 <osd_1_ip>:6806/20125) >> > 2012-02-27 17:26:21.074926 osd.0 <osd_0_ip>:6801/29931 1 : [WRN] map >> > e370 >> > wrongly marked me down or wrong addr >> > >> > a couple of times. The situation stabilizes in a normal state after >> > about two minutes. >> > >> > Should I worry about this? Maybe the first message is about the just >> > killed OSD, and the second comes from the new incarnation, and this is >> > completely normal? This is Ceph 0.41. >> >> It's not normal. Wido was seeing something similar, I think. I suspect >> the problem is that during startup ceph-osd just busy, but the heartbeat >> code is such that it's not supposed to miss them. >> >> Can you reproduce this with 'debug ms = 1'? > > Yes, I managed to. Output of ceph -w attached (with IP addresses mangled). My > setup is 3 nodes, node 1 and 2 running OSD, MDS and MON, node 3 running MON > only. I also have the logs from all nodes in case you need it. Yes, please. Just the cluster state is not very helpful — we want to see why the OSDs are marking each other down, not when. :) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html