hi,all A strange problem. we use ceph v0.30 and linux 2.6.37. we build a cluster with 20 OSDs.But not all the OSDs are started. at first we just start 10 osds. and then we use 10 kernel clients to write 40GB data in total. now we start another 10 osds(almost at same time), then unusual things happened. a few osd down, we can show the debug file here.and there are many crashed PGs in pg dump. 1) infos about "handle_pg_log" see attach file "handle_pg_log assert failed" 2) infos about "PG::_activate_committed" 2011-08-02 10:23:02.738366 460ff950 filestore(/data/osd.11) sync_entry commit took 0.066092 2011-08-02 10:23:02.738497 458fe950 osd11 36 pg[0.277( v 7'1 lc 0'0 (0'0,7'1] n=1 ec=2 les/c 7/7 36/36/36) [13] r=-1 stray m=1] _activate_committed 9, that was an old interval 2011-08-02 10:23:02.738522 458fe950 osd11 36 pg[1.39b( empty n=0 ec=2 les/c 7/28 36/36/3) [1] r=-1 stray] _activate_committed 9, that was an old interval 2011-08-02 10:23:02.738547 458fe950 osd11 36 pg[1.276( empty n=0 ec=2 les/c 7/34 36/36/36) [13] r=-1 stray] _activate_committed 9, that was an old interval 2011-08-02 10:23:02.738564 458fe950 osd11 36 pg[1.276( empty n=0 ec=2 les/c 7/34 36/36/36) [13] r=-1 stray] _finish_recovery -- stale 2011-08-02 10:23:02.738584 458fe950 osd11 36 pg[1.320( empty n=0 ec=2 les/c 7/29 36/36/9) [15] r=-1 stray] _activate_committed 9, that was an old interval *** Caught signal (Segmentation fault) ** in thread 0x458fe950 1: /bin/cosd [0x6473a8] 2: /lib/libpthread.so.0 [0x7f2a2bf9ba80] 3: (PG::_activate_committed(unsigned int)+0x98) [0x572fe8] 4: (C_Contexts::finish(int)+0xcc) [0x4dc74c] 5: (Finisher::finisher_thread_entry()+0x1ab) [0x61167b] where do we go wrong? i can not point it out. maybe the recovery workload is very high, we shouldn't add more to it. but the ceph v0.24.3 works fluently.so why the PG was marked CRASHED? i have never saw it before. is their anybody meet the same problem? Best Regards!
Attachment:
handle_pg_log
Description: Binary data