Hi Sagi, On Sun, Feb 14, 2016 at 11:22 PM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > Adding Ming to Cc. > > But I don't think simply not cloning the biovecs is the right thing > to do in the end. This must be something with the bvec iterators. I agree with Christoph, and there might be issues somewhere. > From the log: > iser: sg[0] dma_addr:0x85FC06000 off:0x0 sz:0x200 dma_len:0x200 > iser: sg[1] dma_addr:0x860334000 off:0x0 sz:0x200 dma_len:0x200 <-- gap > iser: sg[2] dma_addr:0x860335000 off:0x0 sz:0x200 dma_len:0x200 <-- gap The above gap shouldn't have come since blk_bio_segment_split() splits out one new bio if gap is detected. Sort of the following code can be added in driver or prep_fn to check if bvec of the rq is correct: rq_for_each_segment(bvec, sc->request, iter) { //check if there is gap between bvec } I don't know how to use iser, and looks everything works fine after I setup virt boundary as 4095 for null_blk by the attachment patch. > Full quote for Ming: > > On Sun, Feb 14, 2016 at 04:02:18PM +0200, Sagi Grimberg wrote: >> >> >>I'm bisecting now, there are a couple of patches from Ming in >> >>the area of the bio splitting code... >> >> >> >>CC'ing Ming, Linux-block and Linux-nvme as iser is identical to nvme >> >>wrt the virtual boundary so I think nvme will break as well. The bisected commit is merged to v4.3, and looks no such kind of report from nvme. >> > >> >Bisection reveals that this one is the culprit: >> > >> >commit 52cc6eead9095e2faf2ec7afc013aa3af1f01ac5 >> >Author: Ming Lei <ming.lei@xxxxxxxxxxxxx> >> >Date: Thu Sep 17 09:58:38 2015 -0600 >> > >> > block: blk-merge: fast-clone bio when splitting rw bios >> > >> > biovecs has become immutable since v3.13, so it isn't necessary >> > to allocate biovecs for the new cloned bios, then we can save >> > one extra biovecs allocation/copy, and the allocation is often >> > not fixed-length and a bit more expensive. >> > >> > For example, if the 'max_sectors_kb' of null blk's queue is set >> > as 16(32 sectors) via sysfs just for making more splits, this patch >> > can increase throught about ~70% in the sequential read test over >> > null_blk(direct io, bs: 1M). >> > >> > Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx> >> > Cc: Kent Overstreet <kent.overstreet@xxxxxxxxx> >> > Cc: Ming Lin <ming.l@xxxxxxxxxxxxxxx> >> > Cc: Dongsu Park <dpark@xxxxxxxxxx> >> > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxxxxx> >> > >> > This fixes a performance regression introduced by commit 54efd50bfd, >> > and allows us to take full advantage of the fact that we have >> >immutable >> > bio_vecs. Hand applied, as it rejected violently with commit >> > 5014c311baa2. >> > >> > Signed-off-by: Jens Axboe <axboe@xxxxxx> >> >-- >> >> Looks like there is a problem with bio_clone_fast() >> >> This change makes the problem go away: >> -- >> diff --git a/block/bio.c b/block/bio.c >> index dbabd48..5e93733 100644 >> --- a/block/bio.c >> +++ b/block/bio.c >> @@ -1791,7 +1791,7 @@ struct bio *bio_split(struct bio *bio, int sectors, >> * Discards need a mutable bio_vec to accommodate the payload >> * required by the DSM TRIM and UNMAP commands. >> */ >> - if (bio->bi_rw & REQ_DISCARD) >> + if (1 || bio->bi_rw & REQ_DISCARD) >> split = bio_clone_bioset(bio, gfp, bs); >> else >> split = bio_clone_fast(bio, gfp, bs); >> -- >> >> Any thoughts? I don't think bio_clone_fast() is wrong, which use bvec table from the original bio, and drivers are not allowed to change the table, and should just call the standard iterator helpers to access bvec. Also bio_for_each_segment_all() can't be used to iterate over one cloned bio. Thanks,
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index 8ba1e97..5328f3c 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -10,6 +10,8 @@ #include <linux/hrtimer.h> #include <linux/lightnvm.h> +#define MAX_SG 128 + struct nullb_cmd { struct list_head list; struct llist_node ll_list; @@ -19,6 +21,7 @@ struct nullb_cmd { unsigned int tag; struct nullb_queue *nq; struct hrtimer timer; + struct scatterlist sg[MAX_SG]; }; struct nullb_queue { @@ -351,6 +354,23 @@ static void null_request_fn(struct request_queue *q) } } +static int handle_sg(struct nullb_cmd *cmd) +{ + int num, i; + struct request *rq = cmd->rq; + + num = blk_rq_map_sg(rq->q, rq, cmd->sg); + trace_printk("%s dump sg for %p(%d)\n", __func__, rq, rq->tag); + for (i = 0; i < num; i++) { + struct scatterlist *sg = &cmd->sg[i]; + trace_printk("\t %4d: %u %u %llu\n", + i, sg->offset, sg->length, + (unsigned long long)sg->dma_address); + } + + return 0; +} + static int null_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { @@ -365,6 +385,8 @@ static int null_queue_rq(struct blk_mq_hw_ctx *hctx, blk_mq_start_request(bd->rq); + handle_sg(cmd); + null_handle_cmd(cmd); return BLK_MQ_RQ_QUEUE_OK; } @@ -707,6 +729,8 @@ static int null_add_dev(void) queue_flag_set_unlocked(QUEUE_FLAG_NONROT, nullb->q); queue_flag_clear_unlocked(QUEUE_FLAG_ADD_RANDOM, nullb->q); + blk_queue_virt_boundary(nullb->q, (1UL << 12) - 1); + blk_queue_max_segments(nullb->q, MAX_SG); mutex_lock(&lock); list_add_tail(&nullb->list, &nullb_list);