On Sat, Mar 17, 2018 at 3:33 AM, Alex Elder <elder@xxxxxxxxxx> wrote: > On 03/16/2018 07:37 AM, Alex Elder wrote: >> From: Ilya Dryomov <idryomov@xxxxxxxxx> >> >> The reason we clone bios is to be able to give each object request >> (and consequently each ceph_osd_data/ceph_msg_data item) its own >> pointer to a (list of) bio(s). The messenger then initializes its >> cursor with cloned bio's ->bi_iter, so it knows where to start reading >> from/writing to. That's all the cloned bios are used for: to determine >> each object request's starting position in the provided data buffer. >> >> Introduce ceph_bio_iter to do exactly that -- store position within bio >> list (i.e. pointer to bio) + position within that bio (i.e. bvec_iter). >> >> Signed-off-by: Ilya Dryomov <idryomov@xxxxxxxxx> > > While reviewing this I had the same reaction the first time, which is > that I really dislike the way you implemented the iterator macros. > > After trying to explain for a bit what I meant, I went ahead and tried > to implement it myself. And I do think it's better, and more understandable. > > Please take a look at my comments and let me know what you think. > > -Alex > >> --- >> drivers/block/rbd.c | 67 +++++++++++++++----------- >> include/linux/ceph/messenger.h | 59 +++++++++++++++++++---- >> include/linux/ceph/osd_client.h | 11 +++-- >> net/ceph/messenger.c | 101 ++++++++++++++-------------------------- >> net/ceph/osd_client.c | 13 ++++-- >> 5 files changed, 139 insertions(+), 112 deletions(-) >> >> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >> index 883f17d6deeb..8eaebf609611 100644 >> --- a/drivers/block/rbd.c >> +++ b/drivers/block/rbd.c >> @@ -218,7 +218,7 @@ typedef void (*rbd_obj_callback_t)(struct rbd_obj_request *); >> >> enum obj_request_type { >> OBJ_REQUEST_NODATA = 1, >> - OBJ_REQUEST_BIO, >> + OBJ_REQUEST_BIO, /* pointer into provided bio (list) */ >> OBJ_REQUEST_PAGES, >> }; >> >> @@ -270,7 +270,7 @@ struct rbd_obj_request { >> >> enum obj_request_type type; >> union { >> - struct bio *bio_list; >> + struct ceph_bio_iter bio_pos; >> struct { >> struct page **pages; >> u32 page_count; >> @@ -1255,6 +1255,27 @@ static u64 rbd_segment_length(struct rbd_device *rbd_dev, >> return length; >> } >> >> +static void zero_bvec(struct bio_vec *bv) >> +{ >> + void *buf; >> + unsigned long flags; >> + >> + buf = bvec_kmap_irq(bv, &flags); >> + memset(buf, 0, bv->bv_len); >> + flush_dcache_page(bv->bv_page); >> + bvec_kunmap_irq(buf, &flags); >> +} >> + >> +static void zero_bios(struct ceph_bio_iter *bio_pos, u32 off, u32 bytes) > > Why do you add the length (bytes) argument? Previously it just goes to > the end of the bio chain (and I think it still does). > >> +{ >> + struct ceph_bio_iter it = *bio_pos; >> + >> + ceph_bio_iter_advance(&it, off); >> + ceph_bio_iter_advance_step(&it, bytes, ({ >> + zero_bvec(&bv); >> + })); > No, it doesn't go until the end of the chain. Because we no longer clone, we operate on the original chain which can be much longer than a single object request. The iterator is advanced by exactly @bytes, i.e. exactly @bytes bytes are zeroed. > I'm looking at this again, and I have to say that once again it strikes > me as just awful. > > It is not at all clear in this context where "bv" comes from. It is a > local variable defined within ceph_bio_iter_advance_step(), which is > a macro. On top of that, zero_bvec() is used within ({ }); does that > cause something unusual to happen in the macro? I really don't > understand it. If you're going to do something odd like this, please > at least offer a short explanation of *why*. The reason I went with the macro based approach is this is how iov_iter iterator wrapper is implemented. I wanted reuse ITER_BVEC bits and add a new ITER_BIO_LIST iterator to iov_iter. In fact the initial version of this commit had a sizeable lib/iov_iter.c hunk. The goal was to have the same piece code do the fill-request iterating in rbd.c and the read/write-message-data iterating in messenger.c. As I worked on it though, I ran into a problem with message checksumming and some iov_iter quirks that made me realize that getting to a point where iov_iter does all the job is a project of its own. Because I didn't have the details worked out, I moved a slimmed down version initially to rbd.c and later to messenger.h. The macros were supposed to be hidden by iov_iter interface, and even though they are in messenger.h for now, it's really just a set of private helpers that no one should ever run into, so I'm not concerned. That is why we still have ceph_msg_data_bio_{next,advance}(). Notice how ceph_msg_data_bio_advance() duplicates the "switch to the next bio" logic which for rbd.c is taken care of by ceph_bio_iter_advance(). Ideally these functions should be removed, I just didn't get to it. > > I'm a big fan of using macros for a variety of reasons. But while > this produces a very simple-looking function, it really makes it hard > (for me anyway) to easily understand what's going on. > > OK, with that out of the way, here's what I see... > > You call ceph_bio_iter_advance() to advance the iterator to the > given byte offset. This boils down to traversing bios in the list > and calling bio_advance_iter() for each until we've reached the > one that contains the offset. Upon return, the (copied) iterator > will point to the bio in the chain containing the offset, and its > iterator field will be positioned at the offset within that bio. > > Then you call ceph_bio_iter_advance_step() with that, telling > it to call zero_bvec() on what turns out to be the loop variable > representing the current bio_vec used for __bio_for_each_segment() > buried in that macro. This is done until the indicated number of > bytes have been zeroed. Correct. > >> +} >> + >> /* >> * bio helpers >> */ >> @@ -1719,13 +1740,14 @@ rbd_img_obj_request_read_callback(struct rbd_obj_request *obj_request) >> rbd_assert(obj_request->type != OBJ_REQUEST_NODATA); >> if (obj_request->result == -ENOENT) { >> if (obj_request->type == OBJ_REQUEST_BIO) >> - zero_bio_chain(obj_request->bio_list, 0); >> + zero_bios(&obj_request->bio_pos, 0, length); >> else >> zero_pages(obj_request->pages, 0, length); >> obj_request->result = 0; >> } else if (xferred < length && !obj_request->result) { >> if (obj_request->type == OBJ_REQUEST_BIO) >> - zero_bio_chain(obj_request->bio_list, xferred); >> + zero_bios(&obj_request->bio_pos, xferred, >> + length - xferred); >> else >> zero_pages(obj_request->pages, xferred, length); >> } >> @@ -2036,11 +2058,8 @@ static void rbd_obj_request_destroy(struct kref *kref) >> rbd_assert(obj_request_type_valid(obj_request->type)); >> switch (obj_request->type) { >> case OBJ_REQUEST_NODATA: >> + case OBJ_REQUEST_BIO: >> break; /* Nothing to do */ >> - case OBJ_REQUEST_BIO: >> - if (obj_request->bio_list) >> - bio_chain_put(obj_request->bio_list); >> - break; >> case OBJ_REQUEST_PAGES: >> /* img_data requests don't own their page array */ >> if (obj_request->pages && >> @@ -2368,7 +2387,7 @@ static void rbd_img_obj_request_fill(struct rbd_obj_request *obj_request, >> >> if (obj_request->type == OBJ_REQUEST_BIO) >> osd_req_op_extent_osd_data_bio(osd_request, num_ops, >> - obj_request->bio_list, length); >> + &obj_request->bio_pos, length); >> else if (obj_request->type == OBJ_REQUEST_PAGES) >> osd_req_op_extent_osd_data_pages(osd_request, num_ops, >> obj_request->pages, length, >> @@ -2396,8 +2415,7 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request, >> struct rbd_device *rbd_dev = img_request->rbd_dev; >> struct rbd_obj_request *obj_request = NULL; >> struct rbd_obj_request *next_obj_request; >> - struct bio *bio_list = NULL; >> - unsigned int bio_offset = 0; >> + struct ceph_bio_iter bio_it; >> struct page **pages = NULL; >> enum obj_operation_type op_type; >> u64 img_offset; >> @@ -2412,9 +2430,9 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request, >> op_type = rbd_img_request_op_type(img_request); >> >> if (type == OBJ_REQUEST_BIO) { >> - bio_list = data_desc; >> + bio_it = *(struct ceph_bio_iter *)data_desc; >> rbd_assert(img_offset == >> - bio_list->bi_iter.bi_sector << SECTOR_SHIFT); >> + bio_it.iter.bi_sector << SECTOR_SHIFT); >> } else if (type == OBJ_REQUEST_PAGES) { >> pages = data_desc; >> } >> @@ -2440,17 +2458,8 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request, >> rbd_img_obj_request_add(img_request, obj_request); >> >> if (type == OBJ_REQUEST_BIO) { >> - unsigned int clone_size; >> - >> - rbd_assert(length <= (u64)UINT_MAX); >> - clone_size = (unsigned int)length; >> - obj_request->bio_list = >> - bio_chain_clone_range(&bio_list, >> - &bio_offset, >> - clone_size, >> - GFP_NOIO); >> - if (!obj_request->bio_list) >> - goto out_unwind; >> + obj_request->bio_pos = bio_it; >> + ceph_bio_iter_advance(&bio_it, length); >> } else if (type == OBJ_REQUEST_PAGES) { >> unsigned int page_count; >> >> @@ -2980,7 +2989,7 @@ static void rbd_img_parent_read(struct rbd_obj_request *obj_request) >> >> if (obj_request->type == OBJ_REQUEST_BIO) >> result = rbd_img_request_fill(img_request, OBJ_REQUEST_BIO, >> - obj_request->bio_list); >> + &obj_request->bio_pos); >> else >> result = rbd_img_request_fill(img_request, OBJ_REQUEST_PAGES, >> obj_request->pages); >> @@ -4093,9 +4102,13 @@ static void rbd_queue_workfn(struct work_struct *work) >> if (op_type == OBJ_OP_DISCARD) >> result = rbd_img_request_fill(img_request, OBJ_REQUEST_NODATA, >> NULL); >> - else >> + else { >> + struct ceph_bio_iter bio_it = { .bio = rq->bio, >> + .iter = rq->bio->bi_iter }; >> + >> result = rbd_img_request_fill(img_request, OBJ_REQUEST_BIO, >> - rq->bio); >> + &bio_it); >> + } >> if (result) >> goto err_img_request; >> >> diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h >> index ead9d85f1c11..d7b9605fd51d 100644 >> --- a/include/linux/ceph/messenger.h >> +++ b/include/linux/ceph/messenger.h >> @@ -93,14 +93,60 @@ static __inline__ bool ceph_msg_data_type_valid(enum ceph_msg_data_type type) >> } >> } >> >> +#ifdef CONFIG_BLOCK >> + >> +struct ceph_bio_iter { >> + struct bio *bio; >> + struct bvec_iter iter; >> +}; > > I *really* think this would be clearer as a function. And in place of > the STEP argument, supply a function pointer and value to supply to > that function. I tried and the result is below... Forgive me for > dumping so much code in my review. I haven't compiled this or anything, > but it provides something concrete that at least in my view is more > understandable, and no more complicated (and most likely not any less > efficient either). > > static void ceph_bio_advance(struct ceph_bio_iter *it, unsigned int n) > { > bio_advance_iter(it->bio, &it->iter, n); > if (!it->iter.bi_size && it->bio->bi_next) { > dout("ceph_bio_advance next bio\n"); > it->bio = it->bio->bi_next; > it->iter = it->bio->bi_iter; > } > } > > static void ceph_bio_apply(struct ceph_bio_iter *it, unsigned int n, > void (*func)(struct rbd_obj_request *, struct bio_vec *), struct rbd_obj_request * would need to become void * when we get to a fully-featured striping implementation for fs/ceph. I mentioned it in my first reply -- "It could certainly take a (struct bio_vec *bv, void *arg) function instead ...". Honestly, I'm not sure it would be that much better and also I still have that vague plan to try to get it into lib/iov_iter.c at some point. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html