hi,all we build cluster with 40 OSDs, and do nothing to it. after we restart the whole cluster use "/etc/init.d/ceph -a restart", recently we got the error like Bug #1486. But it provides more detail debug log. before we stop the cosd process of osd23: 2011-09-19 11:13:33.422941 7f98c3221700 -- 192.168.0.112:6804/6800 <== osd38 192.168.0.120:6801/5899 474 ==== pg_info(1 pgs e52) v1 ==== 316+0+0 (3413838934 0 0) 0x2f73e00 con 0x2c29640 2011-09-19 11:13:33.422947 7f98c3221700 osd23 64 _dispatch 0x2f73e00 pg_info(1 pgs e52) v1 2011-09-19 11:13:33.422969 7f98c3221700 osd23 64 handle_pg_info pg_info(1 pgs e52) v1 from osd38 2011-09-19 11:13:33.422981 7f98c3221700 osd23 64 require_same_or_newer_map 52 (i am 64) 0x2f73e00 2011-09-19 11:13:33.422997 7f98c3221700 osd23 64 pg[2.456( empty n=0 ec=2 les/c 24/51 17/17/17) [38,23] r=1 active] handle_info 2.45 6( empty n=0 ec=2 les/c 24/52 17/17/17) from osd38 2011-09-19 11:13:33.423014 7f98c3221700 osd23 64 pg[2.456( empty n=0 ec=2 les/c 24/52 17/17/17) [38,23] r=1 active] write_info info 309 2011-09-19 11:13:33.423024 7f98c3221700 osd23 64 pg[2.456( empty n=0 ec=2 les/c 24/52 17/17/17) [38,23] r=1 active] write_info bigbl 8 2011-09-19 11:13:33.423030 7f98c3221700 filestore(/data/osd.23) queue_transactions existing osr 0x3060c30/0x33c3b70 2011-09-19 11:13:33.423035 7f98c3221700 filestore(/data/osd.23) queue_transactions (trailing journal) 10200 0x2f986e0 2011-09-19 11:13:33.423039 7f98c3221700 filestore(/data/osd.23) _do_transaction on 0x2f986e0 2011-09-19 11:13:33.423045 7f98c3221700 filestore(/data/osd.23) collection_setattr /data/osd.23/current/2.456_head 'info' len 309 2011-09-19 11:13:33.423071 7f98c3221700 filestore(/data/osd.23) collection_setattr /data/osd.23/current/2.456_head 'info' len 309 = 309 2011-09-19 11:13:33.423079 7f98c3221700 filestore(/data/osd.23) truncate meta/pginfo_2.456/0 size 0 2011-09-19 11:13:33.423085 7f98c3221700 filestore(/data/osd.23) lfn_get cid=meta oid=pginfo_2.456/0 pathname=/data/osd.23/current/me ta/pginfo_2.456_0 lfn=pginfo_2.456_0 is_lfn=0 2011-09-19 11:13:33.423106 7f98c3221700 filestore(/data/osd.23) truncate meta/pginfo_2.456/0 size 0 = 0 2011-09-19 11:13:33.423114 7f98c3221700 filestore(/data/osd.23) write meta/pginfo_2.456/0 0~8 2011-09-19 11:13:33.423120 7f98c3221700 filestore(/data/osd.23) lfn_get cid=meta oid=pginfo_2.456/0 pathname=/data/osd.23/current/me ta/pginfo_2.456_0 lfn=pginfo_2.456_0 is_lfn=0 2011-09-19 11:13:33.423142 7f98c3221700 filestore(/data/osd.23) queue_flusher ep 208 fd 21 0~8 qlen 1 2011-09-19 11:13:33.423150 7f98c5a26700 filestore(/data/osd.23) flusher_entry awoke 2011-09-19 11:13:33.423156 7f98c5a26700 filestore(/data/osd.23) flusher_entry flushing+closing 21 ep 208 2011-09-19 11:13:33.423172 7f98c5a26700 filestore(/data/osd.23) flusher_entry sleeping 2011-09-19 11:13:33.423179 7f98c3221700 filestore(/data/osd.23) write meta/pginfo_2.456/0 0~8 = 8 2011-09-19 11:13:33.423201 7f98c3221700 -- 192.168.0.112:6804/6800 <== osd34 192.168.0.118:6801/5752 304 ==== pg_info(1 pgs e55) v1 ==== 316+0+0 (362667609 0 0) 0x36d0e00 con 0x2b96780 2011-09-19 11:13:33.423207 7f98c3221700 osd23 64 _dispatch 0x36d0e00 pg_info(1 pgs e55) v1 2011-09-19 11:13:33.423212 7f98c3221700 osd23 64 handle_pg_info pg_info(1 pgs e55) v1 from osd34 2011-09-19 11:13:33.423216 7f98c3221700 osd23 64 require_same_or_newer_map 55 (i am 64) 0x36d0e00 2011-09-19 11:13:33.423231 7f98c3221700 osd23 64 pg[1.177( empty n=0 ec=2 les/c 24/54 14/14/14) [34,23] r=1 active] handle_info 1.17 7( empty n=0 ec=2 les/c 24/55 14/14/14) from osd34 2011-09-19 11:13:33.423245 7f98c3221700 osd23 64 pg[1.177( empty n=0 ec=2 les/c 24/55 14/14/14) [34,23] r=1 active] write_info info 309 2011-09-19 11:13:33.423255 7f98c3221700 osd23 64 pg[1.177( empty n=0 ec=2 les/c 24/55 14/14/14) [34,23] r=1 active] write_info bigbl 8 2011-09-19 11:13:33.423261 7f98c3221700 filestore(/data/osd.23) queue_transactions existing osr 0x3443960/0x340db70 2011-09-19 11:13:33.423266 7f98c3221700 filestore(/data/osd.23) queue_transactions (trailing journal) 10201 0x2f98f00 2011-09-19 11:13:33.423270 7f98c3221700 filestore(/data/osd.23) _do_transaction on 0x2f98f00 2011-09-19 11:13:33.423276 7f98c3221700 filestore(/data/osd.23) collection_setattr /data/osd.23/current/1.177_head 'info' len 309 2011-09-19 11:13:33.423296 7f98c3221700 filestore(/data/osd.23) collection_setattr /data/osd.23/current/1.177_head 'info' len 309 = 309 2011-09-19 11:13:33.423303 7f98c3221700 filestore(/data/osd.23) truncate meta/pginfo_1.177/0 size 0 2011-09-19 11:13:33.423309 7f98c3221700 filestore(/data/osd.23) lfn_get cid=meta oid=pginfo_1.177/0 pathname=/data/osd.23/current/me ta/pginfo_1.177_0 lfn=pginfo_1.177_0 is_lfn=0 2011-09-19 11:13:33.423331 7f98c3221700 filestore(/data/osd.23) truncate meta/pginfo_1.177/0 size 0 = 0 2011-09-19 11:13:33.423339 7f98c3221700 filestore(/data/osd.23) write meta/pginfo_1.177/0 0~8 2011-09-19 11:13:33.423344 7f98c3221700 filestore(/data/osd.23) lfn_get cid=meta oid=pginfo_1.177/0 pathname=/data/osd.23/current/me ta/pginfo_1.177_0 lfn=pginfo_1.177_0 is_lfn=0 *** Caught signal (Terminated) ** in thread 0x7f98cdee0720. Shutting down. then start the cosd process,can not read the pginfo_1.177,so Terminated: 2011-09-19 11:13:35.836387 7fc8470a1720 osd23 64 _open_lock_pg 1.177 2011-09-19 11:13:35.836391 7fc8470a1720 osd23 64 _get_pool 1 93 -> 94 2011-09-19 11:13:35.836406 7fc8470a1720 osd23 64 pg[1.177( DNE empty n=0 ec=0 les/c 0/0 0/0/0) [] r=0 mlcod 0'0 inactive] enter Initial 2011-09-19 11:13:35.836412 7fc8470a1720 filestore(/data/osd.23) collection_getattr /data/osd.23/current/1.177_head 'info' 2011-09-19 11:13:35.836435 7fc8470a1720 filestore(/data/osd.23) collection_getattr /data/osd.23/current/1.177_head 'info' = 309 2011-09-19 11:13:35.836443 7fc8470a1720 filestore(/data/osd.23) read meta/pginfo_1.177/0 0~0 2011-09-19 11:13:35.836449 7fc8470a1720 filestore(/data/osd.23) lfn_get cid=meta oid=pginfo_1.177/0 pathname=/data/osd.23/current/meta/pginfo_1.177_0 lfn=pginfo_1.177_0 is_lfn=0 2011-09-19 11:13:35.836464 7fc8470a1720 filestore(/data/osd.23) FileStore::read meta/pginfo_1.177/0 0~0/0 *** Caught signal (Aborted) *** in thread 0x7fc8470a1720 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6) 1: /bsd/bin/cosd() [0x63e102] 2: (()+0xef60) [0x7fc846a80f60] 3: (gsignal()+0x35) [0x7fc84563f165] 4: (abort()+0x180) [0x7fc845641f70] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fc845ed3dc5] 6: (()+0xcb166) [0x7fc845ed2166] 7: (()+0xcb193) [0x7fc845ed2193] 8: (()+0xcb28e) [0x7fc845ed228e] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x132) [0x491f62] 10: (void decode<unsigned int, PG::Interval>(std::map<unsigned int, PG::Interval, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, PG::Interval> > >&, ceph::buffer::list::iterator&)+0x31) [0x5a0b11] 11: (PG::read_state(ObjectStore*)+0x2d3) [0x575213] 12: (OSD::load_pgs()+0x1f5) [0x514f05] 13: (OSD::init()+0x659) [0x51fd39] 14: (main()+0x2401) [0x490321] 15: (__libc_start_main()+0xfd) [0x7fc84562bc4d] 16: /bsd/bin/cosd() [0x48cac9] it's a problem when we restart the osd between the truncate and write, so the meta/pginfo size is zero. why we should truncate before write in PG::write_info? if we got the wrong data in PG::read_state,is it reationable to terminate the cosd process? or can we get the data from peer? thanks! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html