Fwd: OSD reconnected across map epochs, inconsistent pg logs created

Xiangyang Yu <penglaiyxy@xxxxxxxxx> · Wed, 16 Oct 2019 08:34:16 +0800

---------- Forwarded message ---------
发件人： Xiangyang Yu <penglaiyxy@xxxxxxxxx>
Date: 2019年10月15日周二 上午11:32
Subject: OSD reconnected across map epochs, inconsistent pg logs created
To: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
Cc: Gregory Farnum <gfarnum@xxxxxxxxxx>

Hi cephers,
I met a rare case and got inconsistent objects.
As described below:

Assume pg 1.1a maps to osds[1,5,9], osd1 is the primary osd.
Time 1: osd1 osd5 osd9 was online and could send message to each other.
Time 2: old5, osd9 received an new osdmap that showed osd.1 was down
,and at the same time, osd1’s public network was down
manually(physical down)，but osd.0’s cluster network is still online.
Time 3：
Because of receiving an new osdmap that showed osd1 was down, osd5 and
osd9 shutdowned their connections towards osd1 up (through mark_down()
). so there were no existing connections for osd1.
As for osd.1, connections between osd.5/osd.9 encountered a
failure(disconnected by osd.5/osd.9 explicitly) and were going to
enter STANDBY state . As a consequence, these connections were still
existing( their cs_seq > 0).
After a short while, osd1 generated two scrub operations(enable
deep-scrub) about updating some objects version
info(scrub_snapshot_metadata()), and was going to reestablish
connections among osd5 and osd9. When osd1 was sending the first
operation op1(by send_message()), the cluster messenger would
reconnect the osd5/osd9 and then placing the op1 in out_q。During the
connection was enter STATE_OPEN, there was a RESETSESSION between osd1
and osd5/osd9, which lead osd1 to discard the msg in out_q (by
was_session_reset()). After the connection was established, osd1 sent
the second operation op2 to osd5/osd9. Have this attention : OSD.5 and
OSD.9 was updating the osdmap, but have not committed the update. So
op2 did not discard because of mismatch epoch and eventually applied
to osd.5 and osd.9.

Eventually, there two pg log were recorded on osd1(op1,op2), but only
one pg log(op2) on osd5/osd9.
Time4: when osd1 public network recovered soon, during pg peering, the
primary osd(osd1) could not find any difference about pg log among
osd5 and osd9. When pg 1.1a deep-scrubed over, there would trigger an
inconsistent error about object version info(the version info op1
associatived).
This is a rarely situation we meet with. In some case, I think this
would cause the msgs out of order . If I misdiagnosed it,please tell
me.

I have talked with gerg about the problem before.
This is my PR:
https://github.com/ceph/ceph/pull/30609
(I insist on my pull request, because in my opinion, there is no
difference between peer and losslyness connections, any endpoint can
connect to another endpoint. If the osd.5 and osd.9 start a connection
,then the op1 in osd.1 can not discard and would walk through a
replace session flow. )

The event is recorded in tracker:
https://tracker.ceph.com/issues/42058

Anyway, I want to receive any advice on how to resolve the problem
properly, thanks.
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx