Re: [patch 3/4] [PATCH] kvm: Fix tprot locking

Martin Schwidefsky <schwidefsky@xxxxxxxxxx> · Thu, 17 Nov 2011 12:32:33 +0100

On Thu, 17 Nov 2011 12:15:52 +0100
Martin Schwidefsky <schwidefsky@xxxxxxxxxx> wrote:

> On Thu, 17 Nov 2011 12:27:41 +0200
> Avi Kivity <avi@xxxxxxxxxx> wrote:
> 
> > On 11/17/2011 12:00 PM, Carsten Otte wrote:
> > > From: Christian Borntraeger <borntraeger@xxxxxxxxxx> 
> > >
> > > There is a potential host deadlock in the tprot intercept handling.
> > > We must not hold the mmap semaphore while resolving the guest
> > > address. If userspace is remapping, then the memory detection in
> > > the guest is broken anyway so we can safely separate the 
> > > address translation from walking the vmas.
> > >
> > > Signed-off-by: Christian Borntraeger <borntraeger@xxxxxxxxxx> 
> > > Signed-off-by: Carsten Otte <cotte@xxxxxxxxxx>
> > > ---
> > >
> > >  arch/s390/kvm/priv.c |   10 ++++++++--
> > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > >
> > > diff -urpN linux-2.6/arch/s390/kvm/priv.c linux-2.6-patched/arch/s390/kvm/priv.c
> > > --- linux-2.6/arch/s390/kvm/priv.c	2011-10-24 09:10:05.000000000 +0200
> > > +++ linux-2.6-patched/arch/s390/kvm/priv.c	2011-11-17 10:03:53.000000000 +0100
> > > @@ -336,6 +336,7 @@ static int handle_tprot(struct kvm_vcpu
> > >  	u64 address1 = disp1 + base1 ? vcpu->arch.guest_gprs[base1] : 0;
> > >  	u64 address2 = disp2 + base2 ? vcpu->arch.guest_gprs[base2] : 0;
> > >  	struct vm_area_struct *vma;
> > > +	unsigned long user_address;
> > >  
> > >  	vcpu->stat.instruction_tprot++;
> > >  
> > > @@ -349,9 +350,14 @@ static int handle_tprot(struct kvm_vcpu
> > >  		return -EOPNOTSUPP;
> > >  
> > >  
> > > +	/* we must resolve the address without holding the mmap semaphore.
> > > +	 * This is ok since the userspace hypervisor is not supposed to change
> > > +	 * the mapping while the guest queries the memory. Otherwise the guest
> > > +	 * might crash or get wrong info anyway. */
> > > +	user_address = (unsigned long) __guestaddr_to_user(vcpu, address1);
> > > +
> > >  	down_read(&current->mm->mmap_sem);
> > > -	vma = find_vma(current->mm,
> > > -			(unsigned long) __guestaddr_to_user(vcpu, address1));
> > > +	vma = find_vma(current->mm, user_address);
> > >  	if (!vma) {
> > >  		up_read(&current->mm->mmap_sem);
> > >  		return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING);
> > >
> > 
> > Unrelated to the patch, but I'm curious: it looks like __gmap_fault()
> > dereferences the guest page table?  How can it assume that it is mapped?
> 
> The gmap code does not assume that the code is mapped. If the individual
> MB has not been mapped in the guest address space or the target memory
> is not mapped in the process address space __gmap_fault() returns -EFAULT. 
> 
> > I'm probably misreading the code.
> > 
> > A little closer to the patch, x86 handles the same issue by calling
> > get_user_pages_fast().  This should be more scalable than bouncing
> > mmap_sem, something to consider.
> 
> I don't think that the frequency of asynchronous page faults will make
> it necessary to use get_user_pages_fast(). We are talking about the
> case where I/O is necessary to provide the page that the guest accessed.
> 
> The advantage of the way s390 does things is that after __gmap_fault
> translated the guest address to a user space address we can just do a
> standard page fault for the user space process. Only if that requires
> I/O we go the long way. Makes sense?

Hmm, Carsten just made me aware that your question is not about pfault,
it is about the standard case of a guest fault.

For normal guest faults we use a cool trick that the s390 hardware
allows us. We have the paging table for the kvm process and we have the
guest page table for execution in the virtualized context. The trick is
that the guest page table reuses the lowest level of the process page
table. A fault that sets a pte in the process page table will
automatically make that pte visible in the guest page table as well
if the memory region has been mapped in the higher order page tables.
Even the invalidation of a pte will automatically (!!) remove the
referenced page from the guest page table as well, including the TLB
entries on all cpus. The IPTE instruction is your friend :-)
That is why we don't need mm notifiers.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html