Hi Mark, OK, I'll upgrade to the current master and retest... best, Jake On 06/06/17 15:46, Mark Nelson wrote: > Hi Jake, > > I just happened to notice this was on 12.0.3. Would it be possible to > test this out with current master and see if it still is a problem? > > Mark > > On 06/06/2017 09:10 AM, Mark Nelson wrote: >> Hi Jake, >> >> Thanks much. I'm guessing at this point this is probably a bug. Would >> you (or nokiauser) mind creating a bug in the tracker with a short >> description of what's going on and the collectl sample showing this is >> not IOs backing up on the disk? >> >> If you want to try it, we have a gdb based wallclock profiler that might >> be interesting to run while it's in the process of timing out. It tries >> to grab 2000 samples from the osd process which typically takes about 10 >> minutes or so. You'll need to either change the number of samples to be >> lower in the python code (maybe like 50-100), or change the timeout to >> be something longer. >> >> You can find the code here: >> >> https://github.com/markhpc/gdbprof >> >> and invoke it like: >> >> udo gdb -ex 'set pagination off' -ex 'attach 27962' -ex 'source >> ./gdbprof.py' -ex 'profile begin' -ex 'quit' >> >> where 27962 in this case is the PID of the ceph-osd process. You'll >> need gdb with the python bindings and the ceph debug symbols for it to >> work. >> >> This might tell us over time if the tp_osd_tp processes are just sitting >> on pg::locks. >> >> Mark >> >> On 06/06/2017 05:34 AM, Jake Grimmett wrote: >>> Hi Mark, >>> >>> Thanks again for looking into this problem. >>> >>> I ran the cluster overnight, with a script checking for dead OSDs every >>> second, and restarting them. >>> >>> 40 OSD failures occurred in 12 hours, some OSDs failed multiple times, >>> (there are 50 OSDs in the EC tier). >>> >>> Unfortunately, the output of collectl doesn't appear to show any >>> increase in disk queue depth and service times before the OSDs die. >>> >>> I've put a couple of examples of collectl output for the disks >>> associated with the OSDs here: >>> >>> https://hastebin.com/icuvotemot.scala >>> >>> please let me know if you need more info... >>> >>> best regards, >>> >>> Jake >>> >>> -- _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com