On Thu, 2018-07-19 at 13:56 +0200, Jan Kara wrote: > On Thu 19-07-18 19:04:46, Ming Lei wrote: > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > bio_iov_iter_get_pages() returns only pages for a single non- > > > empty > > > segment of the input iov_iter's iovec. This may be much less than > > > the number > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > > bio_iov_iter_get_pages() repeatedly until either the requested > > > number > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is > > > not done, > > > short writes or reads may occur for direct synchronous IOs with > > > multiple > > > iovec slots (such as generated by writev()). In that case, > > > __generic_file_write_iter() falls back to buffered writes, which > > > has been observed to cause data corruption in certain workloads. > > > > > > Note: if segments aren't page-aligned in the input iovec, this > > > patch may > > > result in multiple adjacent slots of the bi_io_vec array to > > > reference the same > > > page (the byte ranges are guaranteed to be disjunct if the > > > preceding patch is > > > applied). We haven't seen problems with that in our and the > > > customer's > > > tests. It'd be possible to detect this situation and merge > > > bi_io_vec slots > > > that refer to the same page, but I prefer to keep it simple for > > > now. > > > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for > > > simplified bdev direct-io") > > > Signed-off-by: Martin Wilck <mwilck@xxxxxxxx> > > > --- > > > fs/block_dev.c | 8 +++++++- > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > index 0dd87aa..41643c4 100644 > > > --- a/fs/block_dev.c > > > +++ b/fs/block_dev.c > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb > > > *iocb, struct iov_iter *iter, > > > > > > ret = bio_iov_iter_get_pages(&bio, iter); > > > if (unlikely(ret)) > > > - return ret; > > > + goto out; > > > + > > > + while (ret == 0 && > > > + bio.bi_vcnt < bio.bi_max_vecs && > > > iov_iter_count(iter) > 0) > > > + ret = bio_iov_iter_get_pages(&bio, iter); > > > + > > > ret = bio.bi_iter.bi_size; > > > > > > if (iov_iter_rw(iter) == READ) { > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, > > > struct iov_iter *iter, > > > put_page(bvec->bv_page); > > > } > > > > > > +out: > > > if (vecs != inline_vecs) > > > kfree(vecs); > > > > > > > You might put the 'vecs' leak fix into another patch, and resue the > > current code block for that. > > > > Looks all users of bio_iov_iter_get_pages() need this kind of fix, > > so > > what do you think about the following way? > > No. AFAICT all the other users of bio_iov_iter_get_pages() are > perfectly > fine with it returning less pages and they loop appropriately. We might consider to better fill the bios in __blkdev_direct_IO(), too, it might be a performance gain. Regards Martin -- Dr. Martin Wilck <mwilck@xxxxxxxx>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg)