On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: > On Fri, 22 Feb 2013, Chris Dunlop wrote: >> G'day, >> >> It seems there might be two issues here: the first being the delayed >> receipt of echo replies causing an seemingly otherwise healthy osd to be >> marked down, the second being the lack of recovery once the downed osd is >> recognised as up again. >> >> Is it worth my opening tracker reports for this, just so it doesn't get >> lost? > > I just looked at the logs. I can't tell what happend to cause that 10 > second delay.. strangely, messages were passing from 0 -> 1, but nothing > came back from 1 -> 0 (although 1 was queuing, if not sending, them). > > The strange bit is that after this, you get those indefinite hangs. From > the logs it looks like the OSD rebound to an old port that was previously > open from osd.0.. probably from way back. Do you have logs going further > back than what you posted? Also, do you have osdmaps, say, 750 and > onward? It looks like there is a bug in the connection handling code > (that is unrelated to the delay above). Currently uploading logs starting midnight to dropbox, will send links when when they're up. How would I retrieve the interesting osdmaps? Chris. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html