On 03/15/2013 05:17 PM, Greg Farnum wrote: > [Putting list back on cc] > > On Friday, March 15, 2013 at 4:11 PM, Jim Schutt wrote: > >> On 03/15/2013 04:23 PM, Greg Farnum wrote: >>> As I come back and look at these again, I'm not sure what the context >>> for these logs is. Which test did they come from, and which behavior >>> (slow or not slow, etc) did you see? :) -Greg >> >> >> >> They come from a test where I had debug mds = 20 and debug ms = 1 >> on the MDS while writing files from 198 clients. It turns out that >> for some reason I need debug mds = 20 during writing to reproduce >> the slow stat behavior later. >> >> strace.find.dirs.txt.bz2 contains the log of running >> strace -tt -o strace.find.dirs.txt find /mnt/ceph/stripe-4M -type d -exec ls -lhd {} \; >> >> From that output, I believe that the stat of at least these files is slow: >> zero0.rc11 >> zero0.rc30 >> zero0.rc46 >> zero0.rc8 >> zero0.tc103 >> zero0.tc105 >> zero0.tc106 >> I believe that log shows slow stats on more files, but those are the first few. >> >> mds.cs28.slow-stat.partial.bz2 contains the MDS log from just before the >> find command started, until just after the fifth or sixth slow stat from >> the list above. >> >> I haven't yet tried to find other ways of reproducing this, but so far >> it appears that something happens during the writing of the files that >> ends up causing the condition that results in slow stat commands. >> >> I have the full MDS log from the writing of the files, as well, but it's >> big.... >> >> Is that what you were after? >> >> Thanks for taking a look! >> >> -- Jim > > I just was coming back to these to see what new information was > available, but I realized we'd discussed several tests and I wasn't > sure what these ones came from. That information is enough, yes. > > If in fact you believe you've only seen this with high-level MDS > debugging, I believe the cause is as I mentioned last time: the MDS > is flapping a bit and so some files get marked as "needsrecover", but > they aren't getting recovered asynchronously, and the first thing > that pokes them into doing a recover is the stat. OK, that makes sense. > That's definitely not the behavior we want and so I'll be poking > around the code a bit and generating bugs, but given that explanation > it's a bit less scary than random slow stats are so it's not such a > high priority. :) Do let me know if you come across it without the > MDS and clients having had connection issues! No problem - thanks! -- Jim > -Greg > > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html