On Wed, Jun 09, 10:29, Sage Weil wrote: > > I'll let you know if I can trigger it reliably. I recreated the cephfs using the same setup (7 osds, 3 mons, 3 mds), and the problem happened again, this time while running "stress" from two clients over the weekend. This morning all stress processes were stuck in state D and access to the ceph fs blocks. Not even ls -l works. I rebooted one machine that was running stress, cosd, cmds and cmon. Had to power cycle it as reboot was unable to kill the stress processes. On reboot, cosd crashes after a few seconds due to hitting assert(recovering_oids.count(soid) == 0 in start_recovery_op(). gdb output: osd/PG.cc: In function 'void PG::start_recovery_op(const sobject_t&)': osd/PG.cc:1833: FAILED assert(recovering_oids.count(soid) == 0) 1: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t, eversion_t, bool, unsigned long, eversion_t)+0x84e) [0x49c88e] 2: (ReplicatedPG::do_op(MOSDOp*)+0xa9a) [0x4a22ba] 3: (OSD::dequeue_op(PG*)+0x402) [0x4e4dc2] 4: (ThreadPool::worker()+0x1fc) [0x5ec64c] 5: (ThreadPool::WorkThread::entry()+0xd) [0x50480d] 6: (Thread::_entry_func(void*)+0x7) [0x476f57] 7: /lib/libpthread.so.0 [0x7fc9556523f7] 8: (clone()+0x6d) [0x7fc9548a7b4d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion*' Program received signal SIGABRT, Aborted. (gdb) bt #0 0x00007fc954802095 in raise () from /lib/libc.so.6 #1 0x00007fc954803af0 in abort () from /lib/libc.so.6 #2 0x00007fc9550870e4 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6 #3 0x00007fc955085076 in ?? () from /usr/lib/libstdc++.so.6 #4 0x00007fc9550850a3 in std::terminate () from /usr/lib/libstdc++.so.6 #5 0x00007fc95508518a in __cxa_throw () from /usr/lib/libstdc++.so.6 #6 0x00000000005eb22f in ceph::__ceph_assert_fail (assertion=0x6211d0 "recovering_oids.count(soid) == 0", file=0x620113 "osd/PG.cc", line=1833, func=0x622140 "void PG::start_recovery_op(const sobject_t&)") at common/assert.cc:30 #7 0x0000000000546b68 in PG::start_recovery_op (this=0xa846d0, soid=@0x12ceb98) at osd/PG.cc:1833 #8 0x000000000049c88e in ReplicatedPG::issue_repop (this=0xa846d0, repop=0x1296d70, now=<value optimized out>, old_last_update= {version = 249, epoch = 137, __pad = 0}, old_exists=true, old_size=4194304, old_version={version = 249, epoch = 137, __pad = 0}) at osd/ReplicatedPG.cc:2280 #9 0x00000000004a22ba in ReplicatedPG::do_op (this=0xa846d0, op=<value optimized out>) at osd/ReplicatedPG.cc:637 #10 0x00000000004e4dc2 in OSD::dequeue_op (this=0x8a9fc0, pg=0xa846d0) at osd/OSD.cc:4456 #11 0x00000000005ec64c in ThreadPool::worker (this=0x8aa478) at common/WorkQueue.cc:44 #12 0x000000000050480d in ThreadPool::WorkThread::entry (this=<value optimized out>) at ./common/WorkQueue.h:113 #13 0x0000000000476f57 in Thread::_entry_func (arg=0xa63) at ./common/Thread.h:39 #14 0x00007fc9556523f7 in start_thread () from /lib/libpthread.so.0 #15 0x00007fc9548a7b4d in clone () from /lib/libc.so.6 #16 0x0000000000000000 in ?? () I also ran the checkpg script as you suggested, but this did not find any corrupted pgs. The tip of the git branch this cosd was compiled from is 214a42798b4a5cd57d09c6a13b39b17c4f616aa3 (mds: handle dup anchorclient ACKs gracefully) Any hints? Andre -- The only person who always got his work done by Friday was Robinson Crusoe
Attachment:
signature.asc
Description: Digital signature