With this patch, the issue isn't encountered in my environment (more than 20 runs of tests). Tested-by: Jerry Lee <leisurelysw24@xxxxxxxxx> Thanks! On Wed, 21 Aug 2019 at 22:56, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Wed, 2019-08-21 at 14:07 +0200, Ilya Dryomov wrote: > > We can't rely on ->peer_features in calc_target() because it may be > > called both when the OSD session is established and open and when it's > > not. ->peer_features is not valid unless the OSD session is open. If > > this happens on a PG split (pg_num increase), that could mean we don't > > resend a request that should have been resent, hanging the client > > indefinitely. > > > > In userspace this was fixed by looking at require_osd_release and > > get_xinfo[osd].features fields of the osdmap. However these fields > > belong to the OSD section of the osdmap, which the kernel doesn't > > decode (only the client section is decoded). > > > > Instead, let's drop this feature check. It effectively checks for > > luminous, so only pre-luminous OSDs would be affected in that on a PG > > split the kernel might resend a request that should not have been > > resent. Duplicates can occur in other scenarios, so both sides should > > already be prepared for them: see dup/replay logic on the OSD side and > > retry_attempt check on the client side. > > > > Cc: stable@xxxxxxxxxxxxxxx > > Fixes: 7de030d6b10a ("libceph: resend on PG splits if OSD has RESEND_ON_SPLIT") > > Reported-by: Jerry Lee <leisurelysw24@xxxxxxxxx> > > Signed-off-by: Ilya Dryomov <idryomov@xxxxxxxxx> > > --- > > net/ceph/osd_client.c | 9 ++++----- > > 1 file changed, 4 insertions(+), 5 deletions(-) > > > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c > > index fed6b0334609..4e78d1ddd441 100644 > > --- a/net/ceph/osd_client.c > > +++ b/net/ceph/osd_client.c > > @@ -1514,7 +1514,7 @@ static enum calc_target_result calc_target(struct ceph_osd_client *osdc, > > struct ceph_osds up, acting; > > bool force_resend = false; > > bool unpaused = false; > > - bool legacy_change; > > + bool legacy_change = false; > > bool split = false; > > bool sort_bitwise = ceph_osdmap_flag(osdc, CEPH_OSDMAP_SORTBITWISE); > > bool recovery_deletes = ceph_osdmap_flag(osdc, > > @@ -1602,15 +1602,14 @@ static enum calc_target_result calc_target(struct ceph_osd_client *osdc, > > t->osd = acting.primary; > > } > > > > - if (unpaused || legacy_change || force_resend || > > - (split && con && CEPH_HAVE_FEATURE(con->peer_features, > > - RESEND_ON_SPLIT))) > > + if (unpaused || legacy_change || force_resend || split) > > ct_res = CALC_TARGET_NEED_RESEND; > > else > > ct_res = CALC_TARGET_NO_ACTION; > > > > out: > > - dout("%s t %p -> ct_res %d osd %d\n", __func__, t, ct_res, t->osd); > > + dout("%s t %p -> %d%d%d%d ct_res %d osd%d\n", __func__, t, unpaused, > > + legacy_change, force_resend, split, ct_res, t->osd); > > return ct_res; > > } > > > > Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx> >