Re: osd: terminate called after throwing an instance of 'std::bad_alloc'

Andre Noll <maan@xxxxxxxxxxxxxxx> · Mon, 14 Jun 2010 10:48:58 +0200

On Wed, Jun 09, 10:29, Sage Weil wrote:

> > I'll let you know if I can trigger it reliably.

I recreated the cephfs using the same setup (7 osds, 3 mons, 3 mds),
and the problem happened again, this time while running "stress"
from two clients over the weekend.

This morning all stress processes were stuck in state D and access
to the ceph fs blocks. Not even ls -l works.  I rebooted one machine
that was running stress, cosd, cmds and cmon. Had to power cycle it
as reboot was unable to kill the stress processes.

On reboot, cosd crashes after a few seconds due to hitting

	assert(recovering_oids.count(soid) == 0

in start_recovery_op(). gdb output:

	osd/PG.cc: In function 'void PG::start_recovery_op(const sobject_t&)':
	osd/PG.cc:1833: FAILED assert(recovering_oids.count(soid) == 0)
	 1: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t, eversion_t, bool, unsigned long, eversion_t)+0x84e) [0x49c88e]
	 2: (ReplicatedPG::do_op(MOSDOp*)+0xa9a) [0x4a22ba]
	 3: (OSD::dequeue_op(PG*)+0x402) [0x4e4dc2]
	 4: (ThreadPool::worker()+0x1fc) [0x5ec64c]
	 5: (ThreadPool::WorkThread::entry()+0xd) [0x50480d]
	 6: (Thread::_entry_func(void*)+0x7) [0x476f57]
	 7: /lib/libpthread.so.0 [0x7fc9556523f7]
	 8: (clone()+0x6d) [0x7fc9548a7b4d]
	 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
	terminate called after throwing an instance of 'ceph::FailedAssertion*'

	Program received signal SIGABRT, Aborted.

	(gdb) bt
	#0  0x00007fc954802095 in raise () from /lib/libc.so.6
	#1  0x00007fc954803af0 in abort () from /lib/libc.so.6
	#2  0x00007fc9550870e4 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6
	#3  0x00007fc955085076 in ?? () from /usr/lib/libstdc++.so.6
	#4  0x00007fc9550850a3 in std::terminate () from /usr/lib/libstdc++.so.6
	#5  0x00007fc95508518a in __cxa_throw () from /usr/lib/libstdc++.so.6
	#6  0x00000000005eb22f in ceph::__ceph_assert_fail (assertion=0x6211d0 "recovering_oids.count(soid) == 0", file=0x620113 "osd/PG.cc", line=1833, 
	    func=0x622140 "void PG::start_recovery_op(const sobject_t&)") at common/assert.cc:30
	#7  0x0000000000546b68 in PG::start_recovery_op (this=0xa846d0, soid=@0x12ceb98) at osd/PG.cc:1833
	#8  0x000000000049c88e in ReplicatedPG::issue_repop (this=0xa846d0, repop=0x1296d70, now=<value optimized out>, old_last_update=
	      {version = 249, epoch = 137, __pad = 0}, old_exists=true, old_size=4194304, old_version={version = 249, epoch = 137, __pad = 0})
	    at osd/ReplicatedPG.cc:2280
	#9  0x00000000004a22ba in ReplicatedPG::do_op (this=0xa846d0, op=<value optimized out>) at osd/ReplicatedPG.cc:637
	#10 0x00000000004e4dc2 in OSD::dequeue_op (this=0x8a9fc0, pg=0xa846d0) at osd/OSD.cc:4456
	#11 0x00000000005ec64c in ThreadPool::worker (this=0x8aa478) at common/WorkQueue.cc:44
	#12 0x000000000050480d in ThreadPool::WorkThread::entry (this=<value optimized out>) at ./common/WorkQueue.h:113
	#13 0x0000000000476f57 in Thread::_entry_func (arg=0xa63) at ./common/Thread.h:39
	#14 0x00007fc9556523f7 in start_thread () from /lib/libpthread.so.0
	#15 0x00007fc9548a7b4d in clone () from /lib/libc.so.6
	#16 0x0000000000000000 in ?? ()

I also ran the checkpg script as you suggested, but this did not find
any corrupted pgs. The tip of the git branch this cosd was compiled
from is

	214a42798b4a5cd57d09c6a13b39b17c4f616aa3 (mds: handle dup anchorclient ACKs gracefully)

Any hints?
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
Attachment:
signature.asc

Description: Digital signature