Re: osd: terminate called after throwing an instance of 'std::bad_alloc'

Andre Noll <maan@xxxxxxxxxxxxxxx> · Fri, 4 Jun 2010 10:45:21 +0200

On Wed, Jun 02, 11:19, Sage Weil Wrote
> Okay, it looks like there is a corrupt PG log.  Can you tar up the 
> $osd_data/current/meta directory, and then 'f 8' and 'p /x info.pgid' from 
> gdb (to figure out which pg it's loading)?

It's in decode_nohead():

	...
	Program received signal SIGABRT, Aborted.
	[Switching to Thread 0x7ff115b566f0 (LWP 5045)]
	0x00007ff1146e9095 in raise () from /lib/libc.so.6
	(gdb) f 8
	#8  0x0000000000540920 in PG::read_log (this=0x7ff1104b6460,
	store=<value optimized out>) at ./include/cstring.h:120
	120         _data = new char[_len + 1];
	(gdb) p /x info.pgid
	$1 = {v = {preferred = {v = 0xffff}, ps = {v = 0x1bf}, pool = {v = 0x0}}}

> There is an open bug for pglog corruption, but I haven't been able to 
> identify where it's actually happening.

How can one determine the pg from the above output? BTW: cosd has
/var/ceph/osd6/current/commit_op_seq open and this file contains the
number 1103797. Does that tell us anything?

> Generally speaking, once you identify the bad pg, you can just delete the 
> offending pglog and data directory from the osd, restart, and it will 
> recover.  Provided you haven't corrupted both copies of the same pg on 
> different osds.  Or more often than not, there is more than one corrupted 
> log, and you have to repeat the process a few times.

That's valuable information, thanks. It should probably be documented
somewhere.

> This is probably the sort of corruption that we should log but not crash 
> on, so that the osd can continue to start up (and just skip the offending 
> pg).  I'll open an issue for that in the tracker.

Thanks
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
Attachment:
signature.asc

Description: Digital signature