---------- Forwarded message --------- 发件人: Xiangyang Yu <penglaiyxy@xxxxxxxxx> Date: 2019年10月15日周二 上午11:32 Subject: OSD reconnected across map epochs, inconsistent pg logs created To: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> Cc: Gregory Farnum <gfarnum@xxxxxxxxxx> Hi cephers, I met a rare case and got inconsistent objects. As described below: Assume pg 1.1a maps to osds[1,5,9], osd1 is the primary osd. Time 1: osd1 osd5 osd9 was online and could send message to each other. Time 2: old5, osd9 received an new osdmap that showed osd.1 was down ,and at the same time, osd1’s public network was down manually(physical down),but osd.0’s cluster network is still online. Time 3: Because of receiving an new osdmap that showed osd1 was down, osd5 and osd9 shutdowned their connections towards osd1 up (through mark_down() ). so there were no existing connections for osd1. As for osd.1, connections between osd.5/osd.9 encountered a failure(disconnected by osd.5/osd.9 explicitly) and were going to enter STANDBY state . As a consequence, these connections were still existing( their cs_seq > 0). After a short while, osd1 generated two scrub operations(enable deep-scrub) about updating some objects version info(scrub_snapshot_metadata()), and was going to reestablish connections among osd5 and osd9. When osd1 was sending the first operation op1(by send_message()), the cluster messenger would reconnect the osd5/osd9 and then placing the op1 in out_q。During the connection was enter STATE_OPEN, there was a RESETSESSION between osd1 and osd5/osd9, which lead osd1 to discard the msg in out_q (by was_session_reset()). After the connection was established, osd1 sent the second operation op2 to osd5/osd9. Have this attention : OSD.5 and OSD.9 was updating the osdmap, but have not committed the update. So op2 did not discard because of mismatch epoch and eventually applied to osd.5 and osd.9. Eventually, there two pg log were recorded on osd1(op1,op2), but only one pg log(op2) on osd5/osd9. Time4: when osd1 public network recovered soon, during pg peering, the primary osd(osd1) could not find any difference about pg log among osd5 and osd9. When pg 1.1a deep-scrubed over, there would trigger an inconsistent error about object version info(the version info op1 associatived). This is a rarely situation we meet with. In some case, I think this would cause the msgs out of order . If I misdiagnosed it,please tell me. I have talked with gerg about the problem before. This is my PR: https://github.com/ceph/ceph/pull/30609 (I insist on my pull request, because in my opinion, there is no difference between peer and losslyness connections, any endpoint can connect to another endpoint. If the osd.5 and osd.9 start a connection ,then the op1 in osd.1 can not discard and would walk through a replace session flow. ) The event is recorded in tracker: https://tracker.ceph.com/issues/42058 Anyway, I want to receive any advice on how to resolve the problem properly, thanks. _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx