hi Sage & Haomai: here we hit a assert the same as reported in tracker.(http://tracker.ceph.com/issues/19605) its hard to reproduce, and now only have limited logs, post useful info as follow. line#1 2017-05-19 22:58:05.142005 7f14f1c1e700 0 -- 10.10.133.1:6823/2019 >> 10.10.133.1:6813/19500 conn(0x55e70327c000 :6823 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION line#2 2017-05-19 22:58:05.142720 7f14f1c1e700 0 -- 10.10.133.1:6823/2019 >> 10.10.133.1:6813/19500 conn(0x55e70327c000 :6823 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=20 cs=1 l=0).process missed message? skipped from seq 0 to 1934764039 line#3 2017-05-19 22:58:05.142834 7f14de2a2700 0 osd.0 pg_epoch: 78440 pg[9.10cs0( v 78440'6350 (78438'4241,78440'6350] local-les=78440 n=8 ec=77949 les/c/f 78440/78440/0 78439/78439/78372) [0,2147483647,6,7,10,1] r=0 lpr=78439 luod=78440'6348 crt=78440'6319 lcod 78440'6319 mlcod 78440'6319 active+undersized+degraded] removing repgather(0x55e708089000 78440'6348 rep_tid=26665 committed?=1 applied?=1 r=0) line#4 2017-05-19 22:58:05.142849 7f14de2a2700 0 osd.0 pg_epoch: 78440 pg[9.10cs0( v 78440'6350 (78438'4241,78440'6350] local-les=78440 n=8 ec=77949 les/c/f 78440/78440/0 78439/78439/78372) [0,2147483647,6,7,10,1] r=0 lpr=78439 luod=78440'6348 crt=78440'6319 lcod 78440'6319 mlcod 78440'6319 active+undersized+degraded] q front is repgather(0x55e6f92c9980 78440'6343 rep_tid=26656 committed?=0 applied?=0 r=0) line#5 2017-05-19 22:58:05.149253 7f14de2a2700 -1 /build/ceph-12.0.2/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)' thread 7f14de2a2700 time 2017-05-19 22:58:05.142856 line#6 /build/ceph-12.0.2/src/osd/PrimaryLogPG.cc: 8615: FAILED assert(repop_queue.front() == repop) we can see that, the connection got something wrong(line#1 remote reconnect but conn_seq==1), and sent session-reset to remote osd. then remote osd goto AsyncConnection::was_session_reset() and discard_out_queue() ! maybe this cause the ordered op-reply message be discard, then remote send follower op's reply when session reconnected. i checked the OSD::ms_handle_remote_reset(), it do nothing, so remote will not resend the reply-op,right? if not assert here, primary pg's osd will requeue the client op after reconnect in Objecter::ms_handle_reset(). so the problem here is the remote reconnect with conn_seq=1? will check reply tomorrow,sorry! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html