Re: Mon losing touch with OSDs

Sage Weil <sage@xxxxxxxxxxx> · Fri, 22 Feb 2013 17:30:04 -0800 (PST)

On Sat, 23 Feb 2013, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
> > On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> >>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> >>>>> On Fri, 22 Feb 2013, Chris Dunlop wrote:
> >>>>>> G'day,
> >>>>>> 
> >>>>>> It seems there might be two issues here: the first being the delayed
> >>>>>> receipt of echo replies causing an seemingly otherwise healthy osd to be
> >>>>>> marked down, the second being the lack of recovery once the downed osd is
> >>>>>> recognised as up again.
> >>>>>> 
> >>>>>> Is it worth my opening tracker reports for this, just so it doesn't get
> >>>>>> lost?
> >>>>> 
> >>>>> I just looked at the logs.  I can't tell what happend to cause that 10 
> >>>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> >>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> >> 
> >> Is there any way of telling where they were delayed, i.e. in the 1's output
> >> queue or 0's input queue?
> > 
> > Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
> > generate a lot of logging, though.
> 
> I really don't want to load the system with too much logging, but I'm happy
> modifying code...  Are there specific interesting debug outputs which I can
> modify so they're output under "ms = 1"?

I'm basically interested in everything in writer() and write_message(), 
and reader() and read_message()...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html