Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

Jason Wang <jasowang@xxxxxxxxxx> · Fri, 8 Mar 2019 16:58:44 +0800

On 2019/3/8 上午3:17, Jerome Glisse wrote:
On Thu, Mar 07, 2019 at 12:56:45PM -0500, Michael S. Tsirkin wrote:
On Thu, Mar 07, 2019 at 10:47:22AM -0500, Michael S. Tsirkin wrote:
On Wed, Mar 06, 2019 at 02:18:12AM -0500, Jason Wang wrote:
+static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
+	.invalidate_range = vhost_invalidate_range,
+};
+
  void vhost_dev_init(struct vhost_dev *dev,
  		    struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
  {
I also wonder here: when page is write protected then
it does not look like .invalidate_range is invoked.

E.g. mm/ksm.c calls

mmu_notifier_invalidate_range_start and
mmu_notifier_invalidate_range_end but not mmu_notifier_invalidate_range.

Similarly, rmap in page_mkclean_one will not call
mmu_notifier_invalidate_range.

If I'm right vhost won't get notified when page is write-protected since you
didn't install start/end notifiers. Note that end notifier can be called
with page locked, so it's not as straight-forward as just adding a call.
Writing into a write-protected page isn't a good idea.

Note that documentation says:
	it is fine to delay the mmu_notifier_invalidate_range
	call to mmu_notifier_invalidate_range_end() outside the page table lock.
implying it's called just later.
OK I missed the fact that _end actually calls
mmu_notifier_invalidate_range internally. So that part is fine but the
fact that you are trying to take page lock under VQ mutex and take same
mutex within notifier probably means it's broken for ksm and rmap at
least since these call invalidate with lock taken.

And generally, Andrea told me offline one can not take mutex under
the notifier callback. I CC'd Andrea for why.
Correct, you _can not_ take mutex or any sleeping lock from within the
invalidate_range callback as those callback happens under the page table
spinlock. You can however do so under the invalidate_range_start call-
back only if it is a blocking allow callback (there is a flag passdown
with the invalidate_range_start callback if you are not allow to block
then return EBUSY and the invalidation will be aborted).

That's a separate issue from set_page_dirty when memory is file backed.
If you can access file back page then i suggest using set_page_dirty
from within a special version of vunmap() so that when you vunmap you
set the page dirty without taking page lock. It is safe to do so
always from within an mmu notifier callback if you had the page map
with write permission which means that the page had write permission
in the userspace pte too and thus it having dirty pte is expected
and calling set_page_dirty on the page is allowed without any lock.
Locking will happen once the userspace pte are tear down through the
page table lock.

Can I simply can set_page_dirty() before vunmap() in the mmu notifier 
callback, or is there any reason that it must be called within vumap()?

Thanks

It's because of all these issues that I preferred just accessing
userspace memory and handling faults. Unfortunately there does not
appear to exist an API that whitelists a specific driver along the lines
of "I checked this code for speculative info leaks, don't add barriers
on data path please".
Maybe it would be better to explore adding such helper then remapping
page into kernel address space ?

Cheers,
Jérôme
_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization