I was looking at this code recently and had similar thoughts; I've included the relevant parts of a conversation on this topic below. In light of that I'm not sure if this patch is appropriate or not (it might be; it certainly needs some cleanup IMO). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ---------- Forwarded message ---------- From: Sage Weil <sage@xxxxxxxxxxx> Date: Thu, Apr 18, 2013 at 1:41 PM Subject: Re: use of osd_client's r_callback To: Gregory Farnum <greg@xxxxxxxxxxx> Cc: Alex Elder <elder@xxxxxxxxxxx>, Josh Durgin <josh.durgin@xxxxxxxxxxx> It is a bit fragile, but not broken AFAICS. We only se the FLAG_ACK flag on the request (which controls whether we get an ACK in addition to a COMMIT reply from the osd) for the sync write path (O_DIRECT and O_SYNC writes).. and in that case, the callback is expecting to be called on the ack and the commit is just clenaing up that safe/unsafe list for fsync. I think we could clean this code up so that the names and meanings of the callbacks are more consistent (e.g., r_commit_callback vs r_ack_callback)... s On Thu, 18 Apr 2013, Gregory Farnum wrote: > The recent series of patches to deal with deadlocks has me spending a > little bit more time looking at the kernel client, especially after I > didn't understand some of the comments going back and forth about safe > and unsafe lists and callbacks (specifically, compared to their > meanings in userspace). This investigation has left me concerned, and > maybe confused, about when we're reporting writes as done. > > In particular, following on from change to "unsafe" so it no longer > indicates any communication from the OSD (which might be fine?), I > spent some time looking at how the other callback is handled. As best > I can tell, the osdc's "r_callback" is used, from the caller's > perspective (drivers/block/rbd.c and fs/ceph/*), to indicate that a > write is permanent and it can forget the data/unlock the pages/etc ? > this matches what the userspace clients do with a "safe" response. > However, in looking at the handle_reply() function, it sure looks to > me as if r_callback() is invoked on the first reply from an OSD ? > which can in many cases be an "unsafe" reply meaning that the data > hasn't been persisted yet! The only reference I can find to it seeing > if the reply is a safe or unsafe reply happens during duplicate > response checking, on the second (or subsequent) responses. > > Am I missing something here, or is the kernel client broken under OSD > failures? (I note that these issues aren't going to turn up when the > OSDs are running xfs/ext4 because on those FSes the first reply will > always be the "safe" response, which explains why we haven't detected > them if there are no other reasons.) > -Greg > > On Tue, Apr 30, 2013 at 6:17 AM, Yan, Zheng <zheng.z.yan@xxxxxxxxx> wrote: > From: "Yan, Zheng" <zheng.z.yan@xxxxxxxxx> > > complete_request() is never called if the first OSD replay doesn't > have ONDISK flag. > > Signed-off-by: Yan, Zheng <zheng.z.yan@xxxxxxxxx> > --- > include/linux/ceph/osd_client.h | 1 - > net/ceph/osd_client.c | 16 ++++++++-------- > 2 files changed, 8 insertions(+), 9 deletions(-) > > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h > index 4191cd2..70f79fb 100644 > --- a/include/linux/ceph/osd_client.h > +++ b/include/linux/ceph/osd_client.h > @@ -145,7 +145,6 @@ struct ceph_osd_request { > s32 r_reply_op_result[CEPH_OSD_MAX_OP]; > int r_got_reply; > int r_linger; > - int r_completed; > > struct ceph_osd_client *r_osdc; > struct kref r_kref; > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c > index 57d8db5..8a5fc37 100644 > --- a/net/ceph/osd_client.c > +++ b/net/ceph/osd_client.c > @@ -1522,6 +1522,8 @@ static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg, > for (i = 0; i < numops; i++) > req->r_reply_op_result[i] = ceph_decode_32(&p); > > + already_completed = req->r_got_reply; > + > if (!req->r_got_reply) { > > req->r_result = result; > @@ -1552,16 +1554,14 @@ static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg, > ((flags & CEPH_OSD_FLAG_WRITE) == 0)) > __unregister_request(osdc, req); > > - already_completed = req->r_completed; > - req->r_completed = 1; > mutex_unlock(&osdc->request_mutex); > - if (already_completed) > - goto done; > > - if (req->r_callback) > - req->r_callback(req, msg); > - else > - complete_all(&req->r_completion); > + if (!already_completed) { > + if (req->r_callback) > + req->r_callback(req, msg); > + else > + complete_all(&req->r_completion); > + } > > if (flags & CEPH_OSD_FLAG_ONDISK) > complete_request(req); > -- > 1.8.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html