On Thu, 2018-07-19 at 17:21 +0200, Jan Kara wrote: > On Thu 19-07-18 20:20:51, Ming Lei wrote: > > On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote: > > > On Thu 19-07-18 19:04:46, Ming Lei wrote: > > > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > > > bio_iov_iter_get_pages() returns only pages for a single non- > > > > > empty > > > > > segment of the input iov_iter's iovec. This may be much less > > > > > than the number > > > > > of pages __blkdev_direct_IO_simple() is supposed to process. > > > > > Call > > > > > bio_iov_iter_get_pages() repeatedly until either the > > > > > requested number > > > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this > > > > > is not done, > > > > > short writes or reads may occur for direct synchronous IOs > > > > > with multiple > > > > > iovec slots (such as generated by writev()). In that case, > > > > > __generic_file_write_iter() falls back to buffered writes, > > > > > which > > > > > has been observed to cause data corruption in certain > > > > > workloads. > > > > > > > > > > Note: if segments aren't page-aligned in the input iovec, > > > > > this patch may > > > > > result in multiple adjacent slots of the bi_io_vec array to > > > > > reference the same > > > > > page (the byte ranges are guaranteed to be disjunct if the > > > > > preceding patch is > > > > > applied). We haven't seen problems with that in our and the > > > > > customer's > > > > > tests. It'd be possible to detect this situation and merge > > > > > bi_io_vec slots > > > > > that refer to the same page, but I prefer to keep it simple > > > > > for now. > > > > > > > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO > > > > > for simplified bdev direct-io") > > > > > Signed-off-by: Martin Wilck <mwilck@xxxxxxxx> > > > > > --- > > > > > fs/block_dev.c | 8 +++++++- > > > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > > > index 0dd87aa..41643c4 100644 > > > > > --- a/fs/block_dev.c > > > > > +++ b/fs/block_dev.c > > > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb > > > > > *iocb, struct iov_iter *iter, > > > > > > > > > > ret = bio_iov_iter_get_pages(&bio, iter); > > > > > if (unlikely(ret)) > > > > > - return ret; > > > > > + goto out; > > > > > + > > > > > + while (ret == 0 && > > > > > + bio.bi_vcnt < bio.bi_max_vecs && > > > > > iov_iter_count(iter) > 0) > > > > > + ret = bio_iov_iter_get_pages(&bio, iter); > > > > > + > > > > > ret = bio.bi_iter.bi_size; > > > > > > > > > > if (iov_iter_rw(iter) == READ) { > > > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb > > > > > *iocb, struct iov_iter *iter, > > > > > put_page(bvec->bv_page); > > > > > } > > > > > > > > > > +out: > > > > > if (vecs != inline_vecs) > > > > > kfree(vecs); > > > > > > > > > > > > > You might put the 'vecs' leak fix into another patch, and resue > > > > the > > > > current code block for that. > > > > > > > > Looks all users of bio_iov_iter_get_pages() need this kind of > > > > fix, so > > > > what do you think about the following way? > > > > > > No. AFAICT all the other users of bio_iov_iter_get_pages() are > > > perfectly > > > fine with it returning less pages and they loop appropriately. > > > > OK, but this way still may make one bio to hold more data, > > especially > > the comment of bio_iov_iter_get_pages() says that 'Pins as many > > pages from > > *iter', so looks it is the correct way to do. > > Well, but as Al pointed out MM may decide that user has already > pinned too > many pages and refuse to pin more. So pinning full iter worth of > pages may > just be impossible. Currently, there are no checks like this in MM > but > eventually, I'd like to account pinned pages in mlock ulimit (or a > limit of > similar kind) and then what Al speaks about would become very real. > So I'd > prefer to not develop new locations that must be able to pin > arbitrary > amount of pages. Doesn't this wrongfully penalize scatter-gather IO? Consider two 1MiB IO requests, one with a single 1 MiB slot, and one with 256 single-page slots. Calling bio_iov_iter_get_pages() once will result in a 1MiB bio in first case, and a 4k bio in the second. You need to submit 256 individual bios to finish the IO. By using the approach which my patch is takes (repeatedly calling bio_io_iter_get_pages before submitting the bio), you'd get a 1Mib bio in both cases. So Ming's idea to generalize this may not be so bad, after all. I don't see that it would increase the pressure on MM. If we are concerned about MM, we should limit the total number of pages that can be pinned, rather than the number of iovec segments that are processed. Martin -- Dr. Martin Wilck <mwilck@xxxxxxxx>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg)