Re: [PATCH v3 kvm/queue 14/16] KVM: Handle page fault for private memory

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Fri, 14 Jan 2022 13:53:15 +0800

hi Sean,
Sorry for the late reply. I just saw this mail in my mailbox.

On Wed, Jan 05, 2022 at 08:52:39PM +0000, Sean Christopherson wrote:
> On Wed, Jan 05, 2022, Yan Zhao wrote:
> > Sorry, maybe I didn't express it clearly.
> > 
> > As in the kvm_faultin_pfn_private(), 
> > static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > 				    struct kvm_page_fault *fault,
> > 				    bool *is_private_pfn, int *r)
> > {
> > 	int order;
> > 	int mem_convert_type;
> > 	struct kvm_memory_slot *slot = fault->slot;
> > 	long pfn = kvm_memfd_get_pfn(slot, fault->gfn, &order);
> > 	...
> > }
> > Currently, kvm_memfd_get_pfn() is called unconditionally.
> > However, if the backend of a private memslot is not memfd, and is device
> > fd for example, a different xxx_get_pfn() is required here.
> 
> Ya, I've complained about this in a different thread[*].  This should really be
> something like kvm_private_fd_get_pfn(), where the underlying ops struct can point
> at any compatible backing store.
> 
> https://lore.kernel.org/all/YcuMUemyBXFYyxCC@xxxxxxxxxx/
>
ok. 

> > Further, though mapped to a private gfn, it might be ok for QEMU to
> > access the device fd in hva-based way (or call it MMU access way, e.g.
> > read/write/mmap), it's desired that it could use the traditional to get
> > pfn without convert the range to a shared one.
> 
> No, this is expressly forbidden.  The backing store for a private gfn must not
> be accessible by userspace.  It's possible a backing store could support both, but
> not concurrently, and any conversion must be done without KVM being involved.
> In other words, resolving a private gfn must either succeed or fail (exit to
> userspace), KVM cannot initiate any conversions.
>
When it comes to a device passthrough via VFIO, there might be more work
related to the device fd as a backend.

First, unlike memfd which can allocate one private fd for a set of PFNs,
and one shared fd for another set of PFNs, for device fd, it needs to open
the same physical device twice, one for shared fd, and one for private fd.

Then, for private device fd, now its ramblock has to use qemu_ram_alloc_from_fd()
instead of current qemu_ram_alloc_from_ptr().
And as in VFIO, this private fd is shared by several ramblocks (each locating from
a different base offset), the base offsets also need to be kept somewhere 
in order to call get_pfn successfully. (this info is kept in
vma through mmap() previously, so without mmap(), a new interface might
be required). 

Also, for shared device fd,  mmap() is required in order to allocate the
ramblock with qemu_ram_alloc_from_ptr(), and more importantly to make
the future gfn_to_hva, and hva_to_pfn possible.
But as the shared and private fds are based on the same physical device,
the vfio driver needs to record which vma ranges are allowed for the actual
mmap_fault, which vma area are not.

With the above changes, it only prevents the host user space from accessing
the device mapped to private GFNs.
For memory backends, host kernel space accessing is prevented via MKTME.
And for device, the device needs to the work to disallow host kernel
space access.
However, unlike memory side, the device side would not cause any MCE. 
Thereby, host user space access to the device also would not cause MCEs, either. 

So, I'm not sure if the above work is worthwhile to the device fd.

> > pfn = __gfn_to_pfn_memslot(slot, fault->gfn, ...)
> > 	|->addr = __gfn_to_hva_many (slot, gfn,...)
> > 	|  pfn = hva_to_pfn (addr,...)
> > 
> > 
> > So, is it possible to recognize such kind of backends in KVM, and to get
> > the pfn in traditional way without converting them to shared?
> > e.g.
> > - specify KVM_MEM_PRIVATE_NONPROTECT to memory regions with such kind
> > of backends, or
> > - detect the fd type and check if get_pfn is provided. if no, go the
> >   traditional way.
> 
> No, because the whole point of this is to make guest private memory inaccessible
> to host userspace.  Or did I misinterpret your questions?
I think the host unmap series is based on the assumption that host user
space access to the memory based to private guest GFNs would cause fatal
MCEs.
So, I hope for backends who will not bring this fatal error can keep
using traditional way to get pfn and be mapped to private GFNs at the
same time.

Thanks
Yan