On 2012. February 28. 08:16:34 Gregory Farnum wrote: > 2012/2/28 Székelyi Szabolcs <szekelyi@xxxxxxx>: > > On 2012. February 27. 09:03:11 Sage Weil wrote: > >> On Mon, 27 Feb 2012, Székelyi Szabolcs wrote: > >> > whenever I restart osd.0 I see a pair of messages like > >> > > >> > 2012-02-27 17:26:00.132666 mon.0 <osd_1_ip>:6789/0 106 : [INF] > >> > osd.0 > >> > <osd_0_ip>:6801/29931 failed (by osd.1 <osd_1_ip>:6806/20125) > >> > 2012-02-27 17:26:21.074926 osd.0 <osd_0_ip>:6801/29931 1 : [WRN] > >> > map > >> > e370 > >> > wrongly marked me down or wrong addr > >> > > >> > a couple of times. The situation stabilizes in a normal state > >> > after > >> > about two minutes. > >> > > >> > Should I worry about this? Maybe the first message is about the > >> > just > >> > killed OSD, and the second comes from the new incarnation, and > >> > this is > >> > completely normal? This is Ceph 0.41. > >> > >> It's not normal. Wido was seeing something similar, I think. I > >> suspect > >> the problem is that during startup ceph-osd just busy, but the > >> heartbeat > >> code is such that it's not supposed to miss them. > >> > >> Can you reproduce this with 'debug ms = 1'? > > > > Yes, I managed to. Output of ceph -w attached (with IP addresses > > mangled). My setup is 3 nodes, node 1 and 2 running OSD, MDS and MON, > > node 3 running MON only. I also have the logs from all nodes in case > > you need it. > > Yes, please. Just the cluster state is not very helpful — we want to > see why the OSDs are marking each other down, not when. :) Okay, it was a firewall issue. The port range that was allowed to reach the OSDs didn't include a number of necessary ports. It started working after a while because I also had a rule like -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT So osd.1 could not talk to osd.0 after a restart (because of the wrong port range), only after osd.0 started talking to osd.1 (because of the -m state rule). Sorry for the noise. -- cc -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html