Re: [PATCH 05/32] libceph, rbd: new bio handling code (aka don't clone bios)

Ilya Dryomov <idryomov@xxxxxxxxx> · Mon, 19 Mar 2018 11:54:43 +0100

On Sat, Mar 17, 2018 at 3:33 AM, Alex Elder <elder@xxxxxxxxxx> wrote:
> On 03/16/2018 07:37 AM, Alex Elder wrote:
>> From: Ilya Dryomov <idryomov@xxxxxxxxx>
>>
>> The reason we clone bios is to be able to give each object request
>> (and consequently each ceph_osd_data/ceph_msg_data item) its own
>> pointer to a (list of) bio(s).  The messenger then initializes its
>> cursor with cloned bio's ->bi_iter, so it knows where to start reading
>> from/writing to.  That's all the cloned bios are used for: to determine
>> each object request's starting position in the provided data buffer.
>>
>> Introduce ceph_bio_iter to do exactly that -- store position within bio
>> list (i.e. pointer to bio) + position within that bio (i.e. bvec_iter).
>>
>> Signed-off-by: Ilya Dryomov <idryomov@xxxxxxxxx>
>
> While reviewing this I had the same reaction the first time, which is
> that I really dislike the way you implemented the iterator macros.
>
> After trying to explain for a bit what I meant, I went ahead and tried
> to implement it myself.  And I do think it's better, and more understandable.
>
> Please take a look at my comments and let me know what you think.
>
>                                         -Alex
>
>> ---
>>  drivers/block/rbd.c             |  67 +++++++++++++++-----------
>>  include/linux/ceph/messenger.h  |  59 +++++++++++++++++++----
>>  include/linux/ceph/osd_client.h |  11 +++--
>>  net/ceph/messenger.c            | 101 ++++++++++++++--------------------------
>>  net/ceph/osd_client.c           |  13 ++++--
>>  5 files changed, 139 insertions(+), 112 deletions(-)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index 883f17d6deeb..8eaebf609611 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -218,7 +218,7 @@ typedef void (*rbd_obj_callback_t)(struct rbd_obj_request *);
>>
>>  enum obj_request_type {
>>       OBJ_REQUEST_NODATA = 1,
>> -     OBJ_REQUEST_BIO,
>> +     OBJ_REQUEST_BIO,        /* pointer into provided bio (list) */
>>       OBJ_REQUEST_PAGES,
>>  };
>>
>> @@ -270,7 +270,7 @@ struct rbd_obj_request {
>>
>>       enum obj_request_type   type;
>>       union {
>> -             struct bio      *bio_list;
>> +             struct ceph_bio_iter    bio_pos;
>>               struct {
>>                       struct page     **pages;
>>                       u32             page_count;
>> @@ -1255,6 +1255,27 @@ static u64 rbd_segment_length(struct rbd_device *rbd_dev,
>>       return length;
>>  }
>>
>> +static void zero_bvec(struct bio_vec *bv)
>> +{
>> +     void *buf;
>> +     unsigned long flags;
>> +
>> +     buf = bvec_kmap_irq(bv, &flags);
>> +     memset(buf, 0, bv->bv_len);
>> +     flush_dcache_page(bv->bv_page);
>> +     bvec_kunmap_irq(buf, &flags);
>> +}
>> +
>> +static void zero_bios(struct ceph_bio_iter *bio_pos, u32 off, u32 bytes)
>
> Why do you add the length (bytes) argument?  Previously it just goes to
> the end of the bio chain (and I think it still does).
>
>> +{
>> +     struct ceph_bio_iter it = *bio_pos;
>> +
>> +     ceph_bio_iter_advance(&it, off);
>> +     ceph_bio_iter_advance_step(&it, bytes, ({
>> +             zero_bvec(&bv);
>> +     }));
>

No, it doesn't go until the end of the chain.  Because we no longer
clone, we operate on the original chain which can be much longer than
a single object request.  The iterator is advanced by exactly @bytes,
i.e. exactly @bytes bytes are zeroed.

> I'm looking at this again, and I have to say that once again it strikes
> me as just awful.
>
> It is not at all clear in this context where "bv" comes from.  It is a
> local variable defined within ceph_bio_iter_advance_step(), which is
> a macro.  On top of that, zero_bvec() is used within ({ }); does that
> cause something unusual to happen in the macro?  I really don't
> understand it.  If you're going to do something odd like this, please
> at least offer a short explanation of *why*.

The reason I went with the macro based approach is this is how iov_iter
iterator wrapper is implemented.  I wanted reuse ITER_BVEC bits and add
a new ITER_BIO_LIST iterator to iov_iter.  In fact the initial version
of this commit had a sizeable lib/iov_iter.c hunk.

The goal was to have the same piece code do the fill-request iterating
in rbd.c and the read/write-message-data iterating in messenger.c.  As
I worked on it though, I ran into a problem with message checksumming
and some iov_iter quirks that made me realize that getting to a point
where iov_iter does all the job is a project of its own.  Because
I didn't have the details worked out, I moved a slimmed down version
initially to rbd.c and later to messenger.h.  The macros were supposed
to be hidden by iov_iter interface, and even though they are in
messenger.h for now, it's really just a set of private helpers that no
one should ever run into, so I'm not concerned.

That is why we still have ceph_msg_data_bio_{next,advance}().  Notice
how ceph_msg_data_bio_advance() duplicates the "switch to the next bio"
logic which for rbd.c is taken care of by ceph_bio_iter_advance().
Ideally these functions should be removed, I just didn't get to it.

>
> I'm a big fan of using macros for a variety of reasons.  But while
> this produces a very simple-looking function, it really makes it hard
> (for me anyway) to easily understand what's going on.
>
> OK, with that out of the way, here's what I see...
>
> You call ceph_bio_iter_advance() to advance the iterator to the
> given byte offset.  This boils down to traversing bios in the list
> and calling bio_advance_iter() for each until we've reached the
> one that contains the offset.  Upon return, the (copied) iterator
> will point to the bio in the chain containing the offset, and its
> iterator field will be positioned at the offset within that bio.
>
> Then you call ceph_bio_iter_advance_step() with that, telling
> it to call zero_bvec() on what turns out to be the loop variable
> representing the current bio_vec used for __bio_for_each_segment()
> buried in that macro.  This is done until the indicated number of
> bytes have been zeroed.

Correct.

>
>> +}
>> +
>>  /*
>>   * bio helpers
>>   */
>> @@ -1719,13 +1740,14 @@ rbd_img_obj_request_read_callback(struct rbd_obj_request *obj_request)
>>       rbd_assert(obj_request->type != OBJ_REQUEST_NODATA);
>>       if (obj_request->result == -ENOENT) {
>>               if (obj_request->type == OBJ_REQUEST_BIO)
>> -                     zero_bio_chain(obj_request->bio_list, 0);
>> +                     zero_bios(&obj_request->bio_pos, 0, length);
>>               else
>>                       zero_pages(obj_request->pages, 0, length);
>>               obj_request->result = 0;
>>       } else if (xferred < length && !obj_request->result) {
>>               if (obj_request->type == OBJ_REQUEST_BIO)
>> -                     zero_bio_chain(obj_request->bio_list, xferred);
>> +                     zero_bios(&obj_request->bio_pos, xferred,
>> +                               length - xferred);
>>               else
>>                       zero_pages(obj_request->pages, xferred, length);
>>       }
>> @@ -2036,11 +2058,8 @@ static void rbd_obj_request_destroy(struct kref *kref)
>>       rbd_assert(obj_request_type_valid(obj_request->type));
>>       switch (obj_request->type) {
>>       case OBJ_REQUEST_NODATA:
>> +     case OBJ_REQUEST_BIO:
>>               break;          /* Nothing to do */
>> -     case OBJ_REQUEST_BIO:
>> -             if (obj_request->bio_list)
>> -                     bio_chain_put(obj_request->bio_list);
>> -             break;
>>       case OBJ_REQUEST_PAGES:
>>               /* img_data requests don't own their page array */
>>               if (obj_request->pages &&
>> @@ -2368,7 +2387,7 @@ static void rbd_img_obj_request_fill(struct rbd_obj_request *obj_request,
>>
>>       if (obj_request->type == OBJ_REQUEST_BIO)
>>               osd_req_op_extent_osd_data_bio(osd_request, num_ops,
>> -                                     obj_request->bio_list, length);
>> +                                     &obj_request->bio_pos, length);
>>       else if (obj_request->type == OBJ_REQUEST_PAGES)
>>               osd_req_op_extent_osd_data_pages(osd_request, num_ops,
>>                                       obj_request->pages, length,
>> @@ -2396,8 +2415,7 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
>>       struct rbd_device *rbd_dev = img_request->rbd_dev;
>>       struct rbd_obj_request *obj_request = NULL;
>>       struct rbd_obj_request *next_obj_request;
>> -     struct bio *bio_list = NULL;
>> -     unsigned int bio_offset = 0;
>> +     struct ceph_bio_iter bio_it;
>>       struct page **pages = NULL;
>>       enum obj_operation_type op_type;
>>       u64 img_offset;
>> @@ -2412,9 +2430,9 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
>>       op_type = rbd_img_request_op_type(img_request);
>>
>>       if (type == OBJ_REQUEST_BIO) {
>> -             bio_list = data_desc;
>> +             bio_it = *(struct ceph_bio_iter *)data_desc;
>>               rbd_assert(img_offset ==
>> -                        bio_list->bi_iter.bi_sector << SECTOR_SHIFT);
>> +                        bio_it.iter.bi_sector << SECTOR_SHIFT);
>>       } else if (type == OBJ_REQUEST_PAGES) {
>>               pages = data_desc;
>>       }
>> @@ -2440,17 +2458,8 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
>>               rbd_img_obj_request_add(img_request, obj_request);
>>
>>               if (type == OBJ_REQUEST_BIO) {
>> -                     unsigned int clone_size;
>> -
>> -                     rbd_assert(length <= (u64)UINT_MAX);
>> -                     clone_size = (unsigned int)length;
>> -                     obj_request->bio_list =
>> -                                     bio_chain_clone_range(&bio_list,
>> -                                                             &bio_offset,
>> -                                                             clone_size,
>> -                                                             GFP_NOIO);
>> -                     if (!obj_request->bio_list)
>> -                             goto out_unwind;
>> +                     obj_request->bio_pos = bio_it;
>> +                     ceph_bio_iter_advance(&bio_it, length);
>>               } else if (type == OBJ_REQUEST_PAGES) {
>>                       unsigned int page_count;
>>
>> @@ -2980,7 +2989,7 @@ static void rbd_img_parent_read(struct rbd_obj_request *obj_request)
>>
>>       if (obj_request->type == OBJ_REQUEST_BIO)
>>               result = rbd_img_request_fill(img_request, OBJ_REQUEST_BIO,
>> -                                             obj_request->bio_list);
>> +                                             &obj_request->bio_pos);
>>       else
>>               result = rbd_img_request_fill(img_request, OBJ_REQUEST_PAGES,
>>                                               obj_request->pages);
>> @@ -4093,9 +4102,13 @@ static void rbd_queue_workfn(struct work_struct *work)
>>       if (op_type == OBJ_OP_DISCARD)
>>               result = rbd_img_request_fill(img_request, OBJ_REQUEST_NODATA,
>>                                             NULL);
>> -     else
>> +     else {
>> +             struct ceph_bio_iter bio_it = { .bio = rq->bio,
>> +                                             .iter = rq->bio->bi_iter };
>> +
>>               result = rbd_img_request_fill(img_request, OBJ_REQUEST_BIO,
>> -                                           rq->bio);
>> +                                           &bio_it);
>> +     }
>>       if (result)
>>               goto err_img_request;
>>
>> diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
>> index ead9d85f1c11..d7b9605fd51d 100644
>> --- a/include/linux/ceph/messenger.h
>> +++ b/include/linux/ceph/messenger.h
>> @@ -93,14 +93,60 @@ static __inline__ bool ceph_msg_data_type_valid(enum ceph_msg_data_type type)
>>       }
>>  }
>>
>> +#ifdef CONFIG_BLOCK
>> +
>> +struct ceph_bio_iter {
>> +     struct bio *bio;
>> +     struct bvec_iter iter;
>> +};
>
> I *really* think this would be clearer as a function.  And in place of
> the STEP argument, supply a function pointer and value to supply to
> that function.  I tried and the result is below...  Forgive me for
> dumping so much code in my review.  I haven't compiled this or anything,
> but it provides something concrete that at least in my view is more
> understandable, and no more complicated (and most likely not any less
> efficient either).
>
> static void ceph_bio_advance(struct ceph_bio_iter *it, unsigned int n)
> {
>         bio_advance_iter(it->bio, &it->iter, n);
>         if (!it->iter.bi_size && it->bio->bi_next) {
>                 dout("ceph_bio_advance next bio\n");
>                 it->bio = it->bio->bi_next;
>                 it->iter = it->bio->bi_iter;
>         }
> }
>
> static void ceph_bio_apply(struct ceph_bio_iter *it, unsigned int n,
>                 void (*func)(struct rbd_obj_request *, struct bio_vec *),

struct rbd_obj_request * would need to become void * when we get to
a fully-featured striping implementation for fs/ceph.  I mentioned it
in my first reply -- "It could certainly take a (struct bio_vec *bv,
void *arg) function instead ...".  Honestly, I'm not sure it would be
that much better and also I still have that vague plan to try to get it
into lib/iov_iter.c at some point.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html