Re: updated orangefs tree at kernel.org

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Thu, 8 Oct 2015 22:07:45 +0100

	Something's fishy around pvfs2_inode_read().  Look:

* ->readpage (== read_one_page) is called in normal context; no set_fs()
in sight.

* read_one_page() does kmap() of the page it had been given.  Result is
a kernel pointer.  That pointer is (mis)annotated as userland one and
passed to pvfs2_inode_read().

* pvfs2_inode_read() shoves it into struct iovec and passes that to
wait_for_direct_io() (type == PVFS_IO_READ).  From there it gets
passed to postcopy_buffers(), where we do
                iov_iter_init(&iter, READ, vec, nr_segs, total_size);
                ret = pvfs_bufmap_copy_to_iovec(bufmap, 
                                                &iter,
                                                buffer_index);
which does copy_page_to_iter().

* after that iov_iter_init() we have iter.type equal to ITER_IOVEC | READ,
so copy_page_to_iter() uses __copy_to_user_inatomic() or falls back to
__copy_to_user().  Giving a kernel pointer to either is going to break on
any architecture where separate MMU contexts are used for kernel and userland.
On x86 we get away with that, but e.g. on sparc64 we won't.

More to the point, why do you bother with kmap() at all?  Just set a BVEC_ITER
over that struct page and let copy_page_to_iter() do the right thing when
it comes to it.  IOW, have iov_iter passed to pvfs2_inode_read() /
wait_for_direct_io() / postcopy_buffers() and be done with that - have
read_one_page() do
	struct iov_iter iter;
	struct bvec bv = {.bv_page = page, .bv_len = PAGE_SIZE, .bv_offset = 0};
	iov_iter_init_bvec(&iter, ITER_BVEC | READ, &bv, 1, PAGE_SIZE);

        max_block = ((inode->i_size / blocksize) + 1);

        if (page->index < max_block) {
                loff_t blockptr_offset = (((loff_t) page->index) << blockbits);

                bytes_read = pvfs2_inode_read(inode,
                                              &iter,
                                              blocksize,
                                              &blockptr_offset,
                                              inode->i_size);
	}
	/* will only zero remaining unread portions of the page data */
	iov_iter_zero(~0U, &iter);

and then proceed as you currently do, except that no kunmap() is needed
at all.

precopy_buffers() would also need to take iov_iter, and the other caller
of wait_for_direct_io() (i.e. do_readv_writev()) will need to pass it
an iov_iter as well.

Said that, do_readv_writev() also needs some taking care of.  Both callers
are giving it iter->iov (which, BTW, is only valid for IOV_ITER ones),
and then play very odd games with splitting iovecs if the total size exceeds
your pvfs_bufmap_size_query().  What for?

Sure, I understand not wanting to submit IO in chunks exceeding your limit.
But why do you need a separate iovec each time?  One thing that iov_iter
does is handling the copying to/from partially used iovecs.  copy_page_to_iter
and copy_page_from_iter are just fine with that.  IOW, why bother with
split_iovecs() at all?

wait_for_direct_io() uses 'vec' and 'nr_segs' only for passing them to
{pre,post}copy_buffers(), where they are used to only to initialize iov_iter.
And since copy_page_{to,from}_iter() advances the iterator, there's no
need to reinitialize the damn thing at all.  Just pass the amount to copy
as an explicit argument, rather than using iov_iter_count() in
pvfs_bufmap_copy_to_iovec().

Am I missing anything subtle in there?  Looks like do_readv_writev() would
get much simpler from that *and* ->read_iter()/->write_iter() will work
on arbitrary iov_iter after that...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html