osd recovery failed

huang jun <hjwsm1989@xxxxxxxxx> · Tue, 2 Aug 2011 21:46:00 +0800

hi,all
A strange problem.
we use ceph v0.30 and linux 2.6.37.
we build a cluster with 20 OSDs.But not all the OSDs are started. at
first we just start 10 osds.
and then we use 10 kernel clients to write 40GB data in total.
now we start another 10 osds(almost at same time), then unusual things happened.
a few osd down, we can show the debug file here.and there are many
crashed PGs in pg dump.

1) infos about "handle_pg_log"
see attach file "handle_pg_log assert failed"

2) infos about "PG::_activate_committed"
2011-08-02 10:23:02.738366 460ff950 filestore(/data/osd.11) sync_entry
commit took 0.066092
2011-08-02 10:23:02.738497 458fe950 osd11 36 pg[0.277( v 7'1 lc 0'0
(0'0,7'1] n=1 ec=2 les/c 7/7 36/36/36) [13] r=-1 stray m=1]
_activate_committed 9, that was an old interval
2011-08-02 10:23:02.738522 458fe950 osd11 36 pg[1.39b( empty n=0 ec=2
les/c 7/28 36/36/3) [1] r=-1 stray] _activate_committed 9, that was an
old interval
2011-08-02 10:23:02.738547 458fe950 osd11 36 pg[1.276( empty n=0 ec=2
les/c 7/34 36/36/36) [13] r=-1 stray] _activate_committed 9, that was
an old interval
2011-08-02 10:23:02.738564 458fe950 osd11 36 pg[1.276( empty n=0 ec=2
les/c 7/34 36/36/36) [13] r=-1 stray] _finish_recovery -- stale
2011-08-02 10:23:02.738584 458fe950 osd11 36 pg[1.320( empty n=0 ec=2
les/c 7/29 36/36/9) [15] r=-1 stray] _activate_committed 9, that was
an old interval
*** Caught signal (Segmentation fault) **
 in thread 0x458fe950
 1: /bin/cosd [0x6473a8]
 2: /lib/libpthread.so.0 [0x7f2a2bf9ba80]
 3: (PG::_activate_committed(unsigned int)+0x98) [0x572fe8]
 4: (C_Contexts::finish(int)+0xcc) [0x4dc74c]
 5: (Finisher::finisher_thread_entry()+0x1ab) [0x61167b]

where do we go wrong? i can not point it out. maybe the recovery
workload is very high, we shouldn't add more to it. but the ceph
v0.24.3 works fluently.so why the PG was marked CRASHED? i have never
saw it before.
is their anybody meet the same problem?

Best Regards!
Attachment:
handle_pg_log

Description: Binary data