On Fri, 1 Mar 2013, Chris Dunlop wrote: > On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote: > > On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote: > >> On Sat, 23 Feb 2013, Chris Dunlop wrote: > >>> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote: > >>>> On Sat, 23 Feb 2013, Chris Dunlop wrote: > >>>>> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote: > >>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote: > >>>>>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: > >>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote: > >>>>>>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: > >>>>>>>>>> I just looked at the logs. I can't tell what happend to cause that 10 > >>>>>>>>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing > >>>>>>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them). > >>>>>>> > >>>>>>> Is there any way of telling where they were delayed, i.e. in the 1's output > >>>>>>> queue or 0's input queue? > >>>>>> > >>>>>> Yeah, if you bump it up to 'debug ms = 20'. Be aware that that will > >>>>>> generate a lot of logging, though. > >>>>> > >>>>> I really don't want to load the system with too much logging, but I'm happy > >>>>> modifying code... Are there specific interesting debug outputs which I can > >>>>> modify so they're output under "ms = 1"? > >>>> > >>>> I'm basically interested in everything in writer() and write_message(), > >>>> and reader() and read_message()... > >>> > >>> Like this? > >> > >> Yeah. You could do 2 instead of 1 so you can turn it down. I suspect > >> that this is the lions share of what debug 20 will spam to the log, but > >> hopefully the load is manageable! > > > > Good idea on the '2'. I'll get that installed and wait for it to happen again. > > FYI... > > To avoid running out of disk space for the massive logs, I > started using logrotate on the ceph logs every two hours, which > does a 'service ceph reload' to re-open the log files. > > In the week since doing that I haven't seen any 'slow requests' > at all (the load has stayed the same as before the change), > which means the issue with the osds dropping out, then the > system not recovering properly, also hasn't happened. > > That's a bit suspicious, no? I suspect the logging itself is changing the timing. Let's wait and see if we get lucky... sage > > I've now put the log dirs on each machine on their own 2TB > partition and reverted back to the default daily rotates. > > And once more we're waiting... Godot, is that you? > > > Chris > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html