On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote: > On 2/26/19 7:37 PM, Ming Lei wrote: > > On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote: > >> On 2/26/19 7:21 PM, Ming Lei wrote: > >>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote: > >>>> On 2/26/19 6:53 PM, Ming Lei wrote: > >>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote: > >>>>>> On 2/26/19 6:21 PM, Ming Lei wrote: > >>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@xxxxxxxxx> wrote: > >>>>>>>> > >>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote: > >>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote: > >>>>>>>>>> Hi Jens, > >>>>>>>>>> > >>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote: > >>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote: > >>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote: > >>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages > >>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases > >>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that. > >>>>>>>>>>>>> > >>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to > >>>>>>>>>>>>> check if they need to release pages on completion. This makes them > >>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@xxxxxxxx> > >>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@xxxxxx> > >>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> > >>>>>>>>>>>>> --- > >>>>>>>>>>>>> block/bio.c | 59 ++++++++++++++++++++++++++++++++------- > >>>>>>>>>>>>> fs/block_dev.c | 5 ++-- > >>>>>>>>>>>>> fs/iomap.c | 5 ++-- > >>>>>>>>>>>>> include/linux/blk_types.h | 1 + > >>>>>>>>>>>>> 4 files changed, 56 insertions(+), 14 deletions(-) > >>>>>>>>>>>>> > >>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c > >>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644 > >>>>>>>>>>>>> --- a/block/bio.c > >>>>>>>>>>>>> +++ b/block/bio.c > >>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page, > >>>>>>>>>>>>> } > >>>>>>>>>>>>> EXPORT_SYMBOL(bio_add_page); > >>>>>>>>>>>>> > >>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) > >>>>>>>>>>>>> +{ > >>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec; > >>>>>>>>>>>>> + unsigned int len; > >>>>>>>>>>>>> + size_t size; > >>>>>>>>>>>>> + > >>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count); > >>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len, > >>>>>>>>>>>>> + bv->bv_offset + iter->iov_offset); > >>>>>>>>>>>> > >>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks > >>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption > >>>>>>>>>>>> can be observed when running xfstests over loop/dio. > >>>>>>>>>>> > >>>>>>>>>>> Thanks, I folded this in. > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Jens Axboe > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and > >>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in. > >>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date > >>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700. > >>>>>>>>>> > >>>>>>>>>> Reproducer: > >>>>>>>>>> > >>>>>>>>>> #define _GNU_SOURCE > >>>>>>>>>> #include <fcntl.h> > >>>>>>>>>> #include <linux/loop.h> > >>>>>>>>>> #include <sys/ioctl.h> > >>>>>>>>>> #include <sys/sendfile.h> > >>>>>>>>>> #include <sys/syscall.h> > >>>>>>>>>> #include <unistd.h> > >>>>>>>>>> > >>>>>>>>>> int main(void) > >>>>>>>>>> { > >>>>>>>>>> int memfd, loopfd; > >>>>>>>>>> > >>>>>>>>>> memfd = syscall(__NR_memfd_create, "foo", 0); > >>>>>>>>>> > >>>>>>>>>> pwrite(memfd, "\xa8", 1, 4096); > >>>>>>>>>> > >>>>>>>>>> loopfd = open("/dev/loop0", O_RDWR|O_DIRECT); > >>>>>>>>>> > >>>>>>>>>> ioctl(loopfd, LOOP_SET_FD, memfd); > >>>>>>>>>> > >>>>>>>>>> sendfile(loopfd, loopfd, NULL, 1000000); > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Crash: > >>>>>>>>>> > >>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0 > >>>>>>>>>> flags: 0x100000000000000() > >>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000 > >>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff > >>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) > >>>>>>>>> > >>>>>>>>> I see what this is, I'll cut a fix for this tomorrow. > >>>>>>>> > >>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next > >>>>>>>> branch. > >>>>>>> > >>>>>>> Hi Jens, > >>>>>>> > >>>>>>> I saw the following change is added: > >>>>>>> > >>>>>>> + if (size == len) { > >>>>>>> + /* > >>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this > >>>>>>> + * reference and then not have to put them again when IO > >>>>>>> + * completes. But this breaks some in-kernel users, like > >>>>>>> + * splicing to/from a loop device, where we release the pipe > >>>>>>> + * pages unconditionally. If we can fix that case, we can > >>>>>>> + * get rid of the get here and the need to call > >>>>>>> + * bio_release_pages() at IO completion time. > >>>>>>> + */ > >>>>>>> + get_page(bv->bv_page); > >>>>>>> > >>>>>>> Now the 'bv' may point to more than one page, so the following one may be > >>>>>>> needed: > >>>>>>> > >>>>>>> int i; > >>>>>>> struct bvec_iter_all iter_all; > >>>>>>> struct bio_vec *tmp; > >>>>>>> > >>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all) > >>>>>>> get_page(tmp->bv_page); > >>>>>> > >>>>>> I guess that would be the safest, even if we don't currently have more > >>>>>> than one page in there. I'll fix it up. > >>>>> > >>>>> It is easy to see multipage bvec from loop, :-) > >>>> > >>>> Speaking of this, I took a quick look at why we've now regressed a lot > >>>> on IOPS perf with the multipage work. It looks like it's all related to > >>>> the (much) fatter setup around iteration, which is related to this very > >>>> topic too. > >>>> > >>>> Basically setup of things like bio_for_each_bvec() and indexing through > >>>> nth_page() is MUCH slower than before. > >>> > >>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment() > >>> needs that. However, bio_for_each_segment() isn't called from > >>> blk_queue_split() and blk_rq_map_sg(). > >>> > >>> One issue is that bio_for_each_bvec() still advances by page size > >>> instead of bvec->len, I guess that is the problem, will cook a patch > >>> for your test. > >> > >> Probably won't make a difference for my test case... > >> > >>>> We need to do something about this, it's like tossing out months of > >>>> optimizations. > >>> > >>> Some following optimization can be done, such as removing > >>> biovec_phys_mergeable() from blk_bio_segment_split(). > >> > >> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent > >> that it is possible. But iteration startup cost is a problem in a lot of > >> spots, and a split fast path will only help a bit for that specific > >> case. > >> > >> 5% regressions is HUGE. I know I've mentioned this before, I just want > >> to really stress how big of a deal that is. It's enough to make me > >> consider just reverting it again, which sucks, but I don't feel great > >> shipping something that is known that much slower. > >> > >> Suggestions? > > > > You mentioned nth_page() costs much in bio_for_each_bvec(), but which > > shouldn't call into nth_page(). I will look into it first. > > I'll check on the test box tomorrow, I lost connectivity before. I'll > double check in the morning. > > I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle > consumer. Hi Jens, Could you test the following patch which may improve on the 4k randio test case? diff --git a/block/blk-merge.c b/block/blk-merge.c index 066b66430523..c1ad8abbd9d6 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -447,7 +447,7 @@ static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio, return biovec_phys_mergeable(q, &end_bv, &nxt_bv); } -static struct scatterlist *blk_next_sg(struct scatterlist **sg, +static inline struct scatterlist *blk_next_sg(struct scatterlist **sg, struct scatterlist *sglist) { if (!*sg) @@ -483,7 +483,7 @@ static unsigned blk_bvec_map_sg(struct request_queue *q, offset = (total + bvec->bv_offset) % PAGE_SIZE; idx = (total + bvec->bv_offset) / PAGE_SIZE; - pg = nth_page(bvec->bv_page, idx); + pg = bvec_nth_page(bvec->bv_page, idx); sg_set_page(*sg, pg, seg_size, offset); @@ -512,7 +512,12 @@ __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec, (*sg)->length += nbytes; } else { new_segment: - (*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg); + if (bvec->bv_offset + bvec->bv_len <= PAGE_SIZE) { + *sg = blk_next_sg(sg, sglist); + sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset); + (*nsegs) += 1; + } else + (*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg); } *bvprv = *bvec; } diff --git a/include/linux/bvec.h b/include/linux/bvec.h index 30a57b68d017..4376f683c08a 100644 --- a/include/linux/bvec.h +++ b/include/linux/bvec.h @@ -51,6 +51,11 @@ struct bvec_iter_all { unsigned done; }; +static inline struct page *bvec_nth_page(struct page *page, int idx) +{ + return idx == 0 ? page : nth_page(page, idx); +} + /* * various member access, note that bio_data should of course not be used * on highmem page vectors @@ -87,8 +92,8 @@ struct bvec_iter_all { PAGE_SIZE - bvec_iter_offset((bvec), (iter))) #define bvec_iter_page(bvec, iter) \ - nth_page(mp_bvec_iter_page((bvec), (iter)), \ - mp_bvec_iter_page_idx((bvec), (iter))) + bvec_nth_page(mp_bvec_iter_page((bvec), (iter)), \ + mp_bvec_iter_page_idx((bvec), (iter))) #define bvec_iter_bvec(bvec, iter) \ ((struct bio_vec) { \ @@ -171,7 +176,7 @@ static inline void mp_bvec_last_segment(const struct bio_vec *bvec, unsigned total = bvec->bv_offset + bvec->bv_len; unsigned last_page = (total - 1) / PAGE_SIZE; - seg->bv_page = nth_page(bvec->bv_page, last_page); + seg->bv_page = bvec_nth_page(bvec->bv_page, last_page); /* the whole segment is inside the last page */ if (bvec->bv_offset >= last_page * PAGE_SIZE) { thanks, Ming