Re: Mon losing touch with OSDs

Sage Weil <sage@xxxxxxxxxxx> · Fri, 22 Feb 2013 15:43:22 -0800 (PST)

On Sat, 23 Feb 2013, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> > On Fri, 22 Feb 2013, Chris Dunlop wrote:
> >> G'day,
> >> 
> >> It seems there might be two issues here: the first being the delayed
> >> receipt of echo replies causing an seemingly otherwise healthy osd to be
> >> marked down, the second being the lack of recovery once the downed osd is
> >> recognised as up again.
> >> 
> >> Is it worth my opening tracker reports for this, just so it doesn't get
> >> lost?
> > 
> > I just looked at the logs.  I can't tell what happend to cause that 10 
> > second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> > came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> > 
> > The strange bit is that after this, you get those indefinite hangs.  From 
> > the logs it looks like the OSD rebound to an old port that was previously 
> > open from osd.0.. probably from way back.  Do you have logs going further 
> > back than what you posted?  Also, do you have osdmaps, say, 750 and 
> > onward?  It looks like there is a bug in the connection handling code 
> > (that is unrelated to the delay above).
> 
> Currently uploading logs starting midnight to dropbox, will send
> links when when they're up.
> 
> How would I retrieve the interesting osdmaps?

They are in the monitor data directory, in the osdmap_full dir.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html