[Putting list back on cc] On Friday, March 15, 2013 at 4:11 PM, Jim Schutt wrote: > On 03/15/2013 04:23 PM, Greg Farnum wrote: > > As I come back and look at these again, I'm not sure what the context > > for these logs is. Which test did they come from, and which behavior > > (slow or not slow, etc) did you see? :) -Greg > > > > They come from a test where I had debug mds = 20 and debug ms = 1 > on the MDS while writing files from 198 clients. It turns out that > for some reason I need debug mds = 20 during writing to reproduce > the slow stat behavior later. > > strace.find.dirs.txt.bz2 contains the log of running > strace -tt -o strace.find.dirs.txt find /mnt/ceph/stripe-4M -type d -exec ls -lhd {} \; > > From that output, I believe that the stat of at least these files is slow: > zero0.rc11 > zero0.rc30 > zero0.rc46 > zero0.rc8 > zero0.tc103 > zero0.tc105 > zero0.tc106 > I believe that log shows slow stats on more files, but those are the first few. > > mds.cs28.slow-stat.partial.bz2 contains the MDS log from just before the > find command started, until just after the fifth or sixth slow stat from > the list above. > > I haven't yet tried to find other ways of reproducing this, but so far > it appears that something happens during the writing of the files that > ends up causing the condition that results in slow stat commands. > > I have the full MDS log from the writing of the files, as well, but it's > big.... > > Is that what you were after? > > Thanks for taking a look! > > -- Jim I just was coming back to these to see what new information was available, but I realized we'd discussed several tests and I wasn't sure what these ones came from. That information is enough, yes. If in fact you believe you've only seen this with high-level MDS debugging, I believe the cause is as I mentioned last time: the MDS is flapping a bit and so some files get marked as "needsrecover", but they aren't getting recovered asynchronously, and the first thing that pokes them into doing a recover is the stat. That's definitely not the behavior we want and so I'll be poking around the code a bit and generating bugs, but given that explanation it's a bit less scary than random slow stats are so it's not such a high priority. :) Do let me know if you come across it without the MDS and clients having had connection issues! -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html