On Fri, 2011-02-18 at 13:58 -0700, Sage Weil wrote: > On Fri, 18 Feb 2011, Jim Schutt wrote: > > On Fri, 2011-02-18 at 12:07 -0700, Colin McCabe wrote: > > > > Anybody else have ideas? :/ > > > > > > > > > > At the risk of asking a dumb question, does syslog have messages from > > > ntpd about date/time adjustments? > > > > No, just a few messages like this: > > > > Feb 18 08:34:43 an12 ntpd[4939]: kernel time sync enabled 0001 > > Feb 18 09:08:53 an12 ntpd[4939]: kernel time sync enabled 4001 > > Feb 18 10:34:16 an12 ntpd[4939]: kernel time sync enabled 0001 > > Feb 18 11:25:30 an12 ntpd[4939]: kernel time sync enabled 4001 > > Feb 18 11:42:36 an12 ntpd[4939]: kernel time sync enabled 0001 > > Feb 18 12:33:52 an12 ntpd[4939]: kernel time sync enabled 4001 > > Feb 18 13:08:00 an12 ntpd[4939]: kernel time sync enabled 0001 > > > > > > > > Also, what is the load on the system like? dout will write to a > > > logfile using write(2), which of course is a blocking system call. 28 > > > seconds is a long time for a write to block, but things can get weird > > > under heavy I/O. > > > > During my tests my systems have high iowait (40-80%), but CPU > > utilization is relatively low. However, my ceph logs are going > > to a dedicated disk via a dedicated I/O controller; i.e. my > > ceph data disks are FC and my log disk is a local SATA disk. > > Also, at least in your earlier example, lots of other crap is going into > the log in a timely manner. > > One thing that might shed some light on this is getting a series of > snapshots of /proc/$cosd_pid/tasks/*/{status,sched,syscall} during this > period. I worked on that a little yesterday... Right now I am running 8 cosds/server, and such a /proc snapshot takes ~6 seconds to complete, due to the number of threads I guess. I'll need to do some more testing at smaller osd counts to see if I can reproduce and get such a snapshot to complete quickly enough to be useful. Also, I was thinking of trying latencytop, or trying to learn enough about some of the new kernel diagnostics to see if there's anything there that might help. > (We also need to map the pthread ids we see in the logs to the > task pid...) It sounds to me like something is preventing the kernel from > scheduling the thread, unrelated to and pthread-level primitives the code > is interacting with. I was thinking along those lines - I'll soon have some results with CONFIG_CGROUP_SCHED disabled. Since it's relatively new, maybe it's causing unexpected trouble here. But I'm really just guessing. -- Jim > I'm not familiar enough with the scheduling > internals to know what that might be. > > sage > > > > > > -- Jim > > > > > > > > A more general problem is priority inversion through the dout lock. > > > For example, some other piece of code could be doing something like this > > > > > > dout << my_happy_printer_function() << dendl > > > > > > If my_happy_printer_function takes some heavily contended locks, this > > > will turn into > > > 1. lock dout lock > > > 2. lock heavily contended locks > > > 3. write(2) to logfile > > > 4. unlock "" > > > 5. unlock dout lock > > > > > > This is one reason why I want to get rid of the dout lock (although > > > not the only one.) > > > > > > Colin > > > > > > > > > On Thu, Feb 17, 2011 at 11:13 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > > > > On Thu, 17 Feb 2011, Jim Schutt wrote: > > > >> Check out the result: > > > >> > > > >> osd.68.log:256027:2011-02-17 15:44:50.141378 7fd42ad57940 osd68 5 tick > > > >> osd.68.log:256028:2011-02-17 15:44:50.141464 7fd42ad57940 osd68 5 tick getting read lock on map_lock > > > >> osd.68.log:256029:2011-02-17 15:44:50.141472 7fd42ad57940 osd68 5 tick got read lock on map_lock > > > >> osd.68.log:256031:2011-02-17 15:44:50.141612 7fd42ad57940 osd68 5 tick sending mon report > > > >> osd.68.log:256032:2011-02-17 15:44:50.141619 7fd42ad57940 osd68 5 tick removing stray pgs > > > >> osd.68.log:256033:2011-02-17 15:44:50.141626 7fd42ad57940 osd68 5 tick sending log to logclient > > > >> osd.68.log:256034:2011-02-17 15:44:50.141633 7fd42ad57940 osd68 5 tick arming timer for next tick <== > > > >> osd.68.log:256277:2011-02-17 15:45:18.481656 7fd42ad57940 osd68 5 tick checking dispatch queue status <== 28 second gap > > > >> osd.68.log:256278:2011-02-17 15:45:18.481669 7fd42ad57940 osd68 5 tick done > > > >> osd.68.log:256279:2011-02-17 15:45:18.481691 7fd42ad57940 osd68 5 tick > > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html