Re: [patch 3/4] [PATCH] kvm: Fix tprot locking

Avi Kivity <avi@xxxxxxxxxx> · Sun, 20 Nov 2011 14:05:56 +0200

On 11/17/2011 01:32 PM, Martin Schwidefsky wrote:
> On Thu, 17 Nov 2011 12:15:52 +0100
> Martin Schwidefsky <schwidefsky@xxxxxxxxxx> wrote:
>
> > On Thu, 17 Nov 2011 12:27:41 +0200
> > Avi Kivity <avi@xxxxxxxxxx> wrote:
> > 
> > > On 11/17/2011 12:00 PM, Carsten Otte wrote:
> > > > From: Christian Borntraeger <borntraeger@xxxxxxxxxx> 
> > > >
> > > > There is a potential host deadlock in the tprot intercept handling.
> > > > We must not hold the mmap semaphore while resolving the guest
> > > > address. If userspace is remapping, then the memory detection in
> > > > the guest is broken anyway so we can safely separate the 
> > > > address translation from walking the vmas.
> > > >
> > > > Signed-off-by: Christian Borntraeger <borntraeger@xxxxxxxxxx> 
> > > > Signed-off-by: Carsten Otte <cotte@xxxxxxxxxx>
> > > > ---
> > > >
> > > >  arch/s390/kvm/priv.c |   10 ++++++++--
> > > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > >
> > > > diff -urpN linux-2.6/arch/s390/kvm/priv.c linux-2.6-patched/arch/s390/kvm/priv.c
> > > > --- linux-2.6/arch/s390/kvm/priv.c	2011-10-24 09:10:05.000000000 +0200
> > > > +++ linux-2.6-patched/arch/s390/kvm/priv.c	2011-11-17 10:03:53.000000000 +0100
> > > > @@ -336,6 +336,7 @@ static int handle_tprot(struct kvm_vcpu
> > > >  	u64 address1 = disp1 + base1 ? vcpu->arch.guest_gprs[base1] : 0;
> > > >  	u64 address2 = disp2 + base2 ? vcpu->arch.guest_gprs[base2] : 0;
> > > >  	struct vm_area_struct *vma;
> > > > +	unsigned long user_address;
> > > >  
> > > >  	vcpu->stat.instruction_tprot++;
> > > >  
> > > > @@ -349,9 +350,14 @@ static int handle_tprot(struct kvm_vcpu
> > > >  		return -EOPNOTSUPP;
> > > >  
> > > >  
> > > > +	/* we must resolve the address without holding the mmap semaphore.
> > > > +	 * This is ok since the userspace hypervisor is not supposed to change
> > > > +	 * the mapping while the guest queries the memory. Otherwise the guest
> > > > +	 * might crash or get wrong info anyway. */
> > > > +	user_address = (unsigned long) __guestaddr_to_user(vcpu, address1);
> > > > +
> > > >  	down_read(&current->mm->mmap_sem);
> > > > -	vma = find_vma(current->mm,
> > > > -			(unsigned long) __guestaddr_to_user(vcpu, address1));
> > > > +	vma = find_vma(current->mm, user_address);
> > > >  	if (!vma) {
> > > >  		up_read(&current->mm->mmap_sem);
> > > >  		return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING);
> > > >
> > > 
> > > Unrelated to the patch, but I'm curious: it looks like __gmap_fault()
> > > dereferences the guest page table?  How can it assume that it is mapped?
> > 
> > The gmap code does not assume that the code is mapped. If the individual
> > MB has not been mapped in the guest address space or the target memory
> > is not mapped in the process address space __gmap_fault() returns -EFAULT. 
> > 
> > > I'm probably misreading the code.
> > > 
> > > A little closer to the patch, x86 handles the same issue by calling
> > > get_user_pages_fast().  This should be more scalable than bouncing
> > > mmap_sem, something to consider.
> > 
> > I don't think that the frequency of asynchronous page faults will make
> > it necessary to use get_user_pages_fast(). We are talking about the
> > case where I/O is necessary to provide the page that the guest accessed.
> > 
> > The advantage of the way s390 does things is that after __gmap_fault
> > translated the guest address to a user space address we can just do a
> > standard page fault for the user space process. Only if that requires
> > I/O we go the long way. Makes sense?
>
> Hmm, Carsten just made me aware that your question is not about pfault,
> it is about the standard case of a guest fault.
>
> For normal guest faults we use a cool trick that the s390 hardware
> allows us. We have the paging table for the kvm process and we have the
> guest page table for execution in the virtualized context. The trick is
> that the guest page table reuses the lowest level of the process page
> table. A fault that sets a pte in the process page table will
> automatically make that pte visible in the guest page table as well
> if the memory region has been mapped in the higher order page tables.
> Even the invalidation of a pte will automatically (!!) remove the
> referenced page from the guest page table as well, including the TLB
> entries on all cpus. The IPTE instruction is your friend :-)
> That is why we don't need mm notifiers.

Yes, that explains it perfectly.  I congratulate you on having such
friendly hardware...

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html