Re: Mon losing touch with OSDs

Sage Weil <sage@xxxxxxxxxxx> · Thu, 28 Feb 2013 21:00:24 -0800 (PST)

On Fri, 1 Mar 2013, Chris Dunlop wrote:
> On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote:
> > On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
> >> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >>> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
> >>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >>>>> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
> >>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >>>>>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> >>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >>>>>>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> >>>>>>>>>> I just looked at the logs.  I can't tell what happend to cause that 10 
> >>>>>>>>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> >>>>>>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> >>>>>>> 
> >>>>>>> Is there any way of telling where they were delayed, i.e. in the 1's output
> >>>>>>> queue or 0's input queue?
> >>>>>> 
> >>>>>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
> >>>>>> generate a lot of logging, though.
> >>>>> 
> >>>>> I really don't want to load the system with too much logging, but I'm happy
> >>>>> modifying code...  Are there specific interesting debug outputs which I can
> >>>>> modify so they're output under "ms = 1"?
> >>>> 
> >>>> I'm basically interested in everything in writer() and write_message(), 
> >>>> and reader() and read_message()...
> >>> 
> >>> Like this?
> >> 
> >> Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
> >> that this is the lions share of what debug 20 will spam to the log, but 
> >> hopefully the load is manageable!
> > 
> > Good idea on the '2'. I'll get that installed and wait for it to happen again.
> 
> FYI...
> 
> To avoid running out of disk space for the massive logs, I
> started using logrotate on the ceph logs every two hours, which
> does a 'service ceph reload' to re-open the log files.
> 
> In the week since doing that I haven't seen any 'slow requests'
> at all (the load has stayed the same as before the change),
> which means the issue with the osds dropping out, then the
> system not recovering properly, also hasn't happened.
> 
> That's a bit suspicious, no?

I suspect the logging itself is changing the timing.  Let's wait and see 
if we get lucky... 

sage

> 
> I've now put the log dirs on each machine on their own 2TB
> partition and reverted back to the default daily rotates.
> 
> And once more we're waiting... Godot, is that you?
> 
> 
> Chris
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html