tracker issue 19605

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



hi Sage & Haomai:

here we hit a assert the same as reported in
tracker.(http://tracker.ceph.com/issues/19605)

its hard to reproduce, and now only have limited logs, post useful
info as follow.
line#1 2017-05-19 22:58:05.142005 7f14f1c1e700  0 --
10.10.133.1:6823/2019 >> 10.10.133.1:6813/19500 conn(0x55e70327c000
:6823 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending
RESETSESSION
line#2 2017-05-19 22:58:05.142720 7f14f1c1e700  0 --
10.10.133.1:6823/2019 >> 10.10.133.1:6813/19500 conn(0x55e70327c000
:6823 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=20 cs=1
l=0).process missed message?  skipped from seq 0 to 1934764039
line#3 2017-05-19 22:58:05.142834 7f14de2a2700  0 osd.0 pg_epoch:
78440 pg[9.10cs0( v 78440'6350 (78438'4241,78440'6350] local-les=78440
n=8 ec=77949 les/c/f 78440/78440/0 78439/78439/78372)
[0,2147483647,6,7,10,1] r=0 lpr=78439 luod=78440'6348 crt=78440'6319
lcod 78440'6319 mlcod 78440'6319 active+undersized+degraded]  removing
repgather(0x55e708089000 78440'6348 rep_tid=26665 committed?=1
applied?=1 r=0)
line#4 2017-05-19 22:58:05.142849 7f14de2a2700  0 osd.0 pg_epoch:
78440 pg[9.10cs0( v 78440'6350 (78438'4241,78440'6350] local-les=78440
n=8 ec=77949 les/c/f 78440/78440/0 78439/78439/78372)
[0,2147483647,6,7,10,1] r=0 lpr=78439 luod=78440'6348 crt=78440'6319
lcod 78440'6319 mlcod 78440'6319 active+undersized+degraded]    q
front is repgather(0x55e6f92c9980 78440'6343 rep_tid=26656
committed?=0 applied?=0 r=0)
line#5 2017-05-19 22:58:05.149253 7f14de2a2700 -1
/build/ceph-12.0.2/src/osd/PrimaryLogPG.cc: In function 'void
PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)' thread
7f14de2a2700 time 2017-05-19 22:58:05.142856
line#6 /build/ceph-12.0.2/src/osd/PrimaryLogPG.cc: 8615: FAILED
assert(repop_queue.front() == repop)

we can see that, the connection got something wrong(line#1 remote
reconnect but conn_seq==1), and sent session-reset to remote osd. then
remote osd
goto AsyncConnection::was_session_reset() and discard_out_queue() !
maybe this cause the ordered op-reply message be discard, then remote
send follower op's
reply when session reconnected. i checked the
OSD::ms_handle_remote_reset(), it do nothing, so remote will not
resend the reply-op,right? if not assert here, primary pg's osd will
requeue the client op after reconnect in Objecter::ms_handle_reset().

so the problem here is the remote reconnect with conn_seq=1?

will check reply tomorrow,sorry!
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux