Hi Martin, I reviewed this code again last week and realized the locking wasn't quite right. And then that the pending_ops counter was largely useless. So most of it has been simplified/rewritten now in master, and this problem will be gone--at least in its current form. Please let us know if you see any new issues with the latest master. (The relevant commit is b47347bd7c377037f7fbc199f0c88b447c9626d1.) Thanks- sage On Thu, 24 Nov 2011, Martin Mailand wrote: > Hi Sage, > I hit it again, this time on another osd > > ceph version 0.38-181-g2e19550 > (commit:2e195500b5d3a8ab8512bcf2a219a6b7ff922c97) > > Thread 1 (Thread 2951): > #0 0x00007f36bbb41b3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 > #1 0x00000000005f5852 in reraise_fatal (signum=6) at > global/signal_handler.cc:59 > #2 0x00000000005f5e4a in handle_fatal_signal (signum=6) at > global/signal_handler.cc:106 > #3 <signal handler called> > #4 0x00007f36ba0c2d05 in raise () from /lib/x86_64-linux-gnu/libc.so.6 > #5 0x00007f36ba0c6ab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6 > #6 0x00007f36ba9796dd in __gnu_cxx::__verbose_terminate_handler() () from > /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > ---Type <return> to continue, or q <return> to quit--- > #7 0x00007f36ba977926 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > #8 0x00007f36ba977953 in std::terminate() () from > /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > #9 0x00007f36ba977a5e in __cxa_throw () from > /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > #10 0x00000000005f6956 in ceph::__ceph_assert_fail (assertion=<value optimized > out>, file=<value optimized out>, line=<value optimized out>, > func=<value optimized out>) at common/assert.cc:70 > #11 0x000000000056616a in OSD::dequeue_op (this=0x25b0000, pg=<value optimized > out>) at osd/OSD.cc:5518 > #12 0x00000000005d4406 in ThreadPool::worker (this=0x25b0408) at > common/WorkQueue.cc:54 > #13 0x00000000005822dd in ThreadPool::WorkThread::entry (this=<value optimized > out>) at ./common/WorkQueue.h:120 > #14 0x00007f36bbb38d8c in start_thread () from > /lib/x86_64-linux-gnu/libpthread.so.0 > #15 0x00007f36ba17504d in clone () from /lib/x86_64-linux-gnu/libc.so.6 > #16 0x0000000000000000 in ?? () > (gdb) thread 1 > [Switching to thread 1 (Thread 2951)]#0 0x00007f36bbb41b3b in raise () from > /lib/x86_64-linux-gnu/libpthread.so.0 > (gdb) frame 11 > #11 0x000000000056616a in OSD::dequeue_op (this=0x25b0000, pg=<value optimized > out>) at osd/OSD.cc:5518 > 5518 osd/OSD.cc: No such file or directory. > in osd/OSD.cc > (gdb) p pending_ops > $1 = 0 > > > > -martin > > > Am 16.11.2011 22:12, schrieb Sage Weil: > > Hi Martin, > > > > I've reread the code twice now and it's really not clear to me how > > pending_ops could get out of sync with the actual queue size. I've pushed > > a couple of patches that remove surrounding dead code and add an > > additional assert sanity check to master. Have you seen this again, or > > just that once? > > > > Opened http://tracker.newdream.net/issues/1727 > > > > Thanks- > > sage > > > > > > On Wed, 16 Nov 2011, Martin Mailand wrote: > > > > > Hi, > > > so after a little help from greg. > > > > > > (gdb) print pending_ops > > > $1 = 0 > > > > > > -martin > > > > > > Sage Weil schrieb: > > > > On Mon, 14 Nov 2011, Gregory Farnum wrote: > > > > > It's not a big deal; logging is expensive. :) Just a backtrace isn't a > > > > > lot to go on, but it's better than nothing! > > > > > > > > > > On Mon, Nov 14, 2011 at 11:45 AM, Martin Mailand<martin@xxxxxxxxxxxx> > > > > > wrote: > > > > > > Hi Gregory, > > > > > > I do not have more at the moment. As I cannot have the debug log > > > > > > always > > > > > > on, > > > > > > a core dump would be the best solution? > > > > > > > > I'm mainly interested in whether pending_ops is 0 or< 0. A 'thread > > > > apply > > > > all bt' may also be useful. > > > > > > > > Thanks! > > > > sage > > > > > > > > > > > > > > -martin > > > > > > > > > > > > Gregory Farnum schrieb: > > > > > > > Do you have any other system state? (More logs, core dumps.) > > > > > > > > > > > > > > Make a bug in the tracker either way so it doesn't get lost track > > > > > > > of. > > > > > > > :) > > > > > > > -Greg > > > > > > > > > > > > > > On Mon, Nov 14, 2011 at 6:04 AM, Martin > > > > > > > Mailand<martin@xxxxxxxxxxxx> > > > > > > > wrote: > > > > > > > > Hi, > > > > > > > > today one of my ods died, the log is. > > > > > > > > > > > > > > > > sd/OSD.cc: In function 'void OSD::dequeue_op(PG*)', in thread > > > > > > > > '7faeb6139700' > > > > > > > > osd/OSD.cc: 5534: FAILED assert(pending_ops> 0) > > > > > > > > ceph version 0.38 > > > > > > > > (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) > > > > > > > > 1: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] > > > > > > > > 2: (ThreadPool::worker()+0x6e6) [0x5b7b16] > > > > > > > > 3: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] > > > > > > > > 4: (()+0x6d8c) [0x7faec4d12d8c] > > > > > > > > 5: (clone()+0x6d) [0x7faec355404d] > > > > > > > > ceph version 0.38 > > > > > > > > (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) > > > > > > > > 1: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] > > > > > > > > 2: (ThreadPool::worker()+0x6e6) [0x5b7b16] > > > > > > > > 3: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] > > > > > > > > 4: (()+0x6d8c) [0x7faec4d12d8c] > > > > > > > > 5: (clone()+0x6d) [0x7faec355404d] > > > > > > > > *** Caught signal (Aborted) ** > > > > > > > > in thread 7faeb6139700 > > > > > > > > ceph version 0.38 > > > > > > > > (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9) > > > > > > > > 1: /usr/bin/ceph-osd() [0x5b8b52] > > > > > > > > 2: (()+0xfc60) [0x7faec4d1bc60] > > > > > > > > 3: (gsignal()+0x35) [0x7faec34a1d05] > > > > > > > > 4: (abort()+0x186) [0x7faec34a5ab6] > > > > > > > > 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) > > > > > > > > [0x7faec3d586dd] > > > > > > > > 6: (()+0xb9926) [0x7faec3d56926] > > > > > > > > 7: (()+0xb9953) [0x7faec3d56953] > > > > > > > > 8: (()+0xb9a5e) [0x7faec3d56a5e] > > > > > > > > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, > > > > > > > > char > > > > > > > > const*)+0x396) [0x5bddb6] > > > > > > > > 10: (OSD::dequeue_op(PG*)+0x4bb) [0x55a4db] > > > > > > > > 11: (ThreadPool::worker()+0x6e6) [0x5b7b16] > > > > > > > > 12: (ThreadPool::WorkThread::entry()+0xd) [0x57398d] > > > > > > > > 13: (()+0x6d8c) [0x7faec4d12d8c] > > > > > > > > 14: (clone()+0x6d) [0x7faec355404d] > > > > > > > > > > > > > > > > Anything else needed to debug this? > > > > > > > > > > > > > > > > -martin > > > > > > > > -- > > > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > > > ceph-devel" in > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > More majordomo info at > > > > > > > > http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > > in > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html