Re: [PATCH] libceph: fix "safe" completion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I was looking at this code recently and had similar thoughts; I've
included the relevant parts of a conversation on this topic below. In
light of that I'm not sure if this patch is appropriate or not (it
might be; it certainly needs some cleanup IMO).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


---------- Forwarded message ----------
From: Sage Weil <sage@xxxxxxxxxxx>
Date: Thu, Apr 18, 2013 at 1:41 PM
Subject: Re: use of osd_client's r_callback
To: Gregory Farnum <greg@xxxxxxxxxxx>
Cc: Alex Elder <elder@xxxxxxxxxxx>, Josh Durgin <josh.durgin@xxxxxxxxxxx>


It is a bit fragile, but not broken AFAICS.  We only se the FLAG_ACK flag
on the request (which controls whether we get an ACK in addition to a
COMMIT reply from the osd) for the sync write path (O_DIRECT and O_SYNC
writes).. and in that case, the callback is expecting to be called on the
ack and the commit is just clenaing up that safe/unsafe list for fsync.

I think we could clean this code up so that the names and meanings of the
callbacks are more consistent (e.g., r_commit_callback vs
r_ack_callback)...

s

On Thu, 18 Apr 2013, Gregory Farnum wrote:

> The recent series of patches to deal with deadlocks has me spending a
> little bit more time looking at the kernel client, especially after I
> didn't understand some of the comments going back and forth about safe
> and unsafe lists and callbacks (specifically, compared to their
> meanings in userspace). This investigation has left me concerned, and
> maybe confused, about when we're reporting writes as done.
>
> In particular, following on from change to "unsafe" so it no longer
> indicates any communication from the OSD (which might be fine?), I
> spent some time looking at how the other callback is handled. As best
> I can tell, the osdc's "r_callback" is used, from the caller's
> perspective (drivers/block/rbd.c and fs/ceph/*), to indicate that a
> write is permanent and it can forget the data/unlock the pages/etc ?
> this matches what the userspace clients do with a "safe" response.
> However, in looking at the handle_reply() function, it sure looks to
> me as if r_callback() is invoked on the first reply from an OSD ?
> which can in many cases be an "unsafe" reply meaning that the data
> hasn't been persisted yet! The only reference I can find to it seeing
> if the reply is a safe or unsafe reply happens during duplicate
> response checking, on the second (or subsequent) responses.
>
> Am I missing something here, or is the kernel client broken under OSD
> failures? (I note that these issues aren't going to turn up when the
> OSDs are running xfs/ext4 because on those FSes the first reply will
> always be the "safe" response, which explains why we haven't detected
> them if there are no other reasons.)
> -Greg
>
>


On Tue, Apr 30, 2013 at 6:17 AM, Yan, Zheng <zheng.z.yan@xxxxxxxxx> wrote:
> From: "Yan, Zheng" <zheng.z.yan@xxxxxxxxx>
>
> complete_request() is never called if the first OSD replay doesn't
> have ONDISK flag.
>
> Signed-off-by: Yan, Zheng <zheng.z.yan@xxxxxxxxx>
> ---
>  include/linux/ceph/osd_client.h |  1 -
>  net/ceph/osd_client.c           | 16 ++++++++--------
>  2 files changed, 8 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 4191cd2..70f79fb 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -145,7 +145,6 @@ struct ceph_osd_request {
>         s32               r_reply_op_result[CEPH_OSD_MAX_OP];
>         int               r_got_reply;
>         int               r_linger;
> -       int               r_completed;
>
>         struct ceph_osd_client *r_osdc;
>         struct kref       r_kref;
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 57d8db5..8a5fc37 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -1522,6 +1522,8 @@ static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg,
>         for (i = 0; i < numops; i++)
>                 req->r_reply_op_result[i] = ceph_decode_32(&p);
>
> +       already_completed = req->r_got_reply;
> +
>         if (!req->r_got_reply) {
>
>                 req->r_result = result;
> @@ -1552,16 +1554,14 @@ static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg,
>             ((flags & CEPH_OSD_FLAG_WRITE) == 0))
>                 __unregister_request(osdc, req);
>
> -       already_completed = req->r_completed;
> -       req->r_completed = 1;
>         mutex_unlock(&osdc->request_mutex);
> -       if (already_completed)
> -               goto done;
>
> -       if (req->r_callback)
> -               req->r_callback(req, msg);
> -       else
> -               complete_all(&req->r_completion);
> +       if (!already_completed) {
> +               if (req->r_callback)
> +                       req->r_callback(req, msg);
> +               else
> +                       complete_all(&req->r_completion);
> +       }
>
>         if (flags & CEPH_OSD_FLAG_ONDISK)
>                 complete_request(req);
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux