On 11/17/2011 01:32 PM, Martin Schwidefsky wrote: > On Thu, 17 Nov 2011 12:15:52 +0100 > Martin Schwidefsky <schwidefsky@xxxxxxxxxx> wrote: > > > On Thu, 17 Nov 2011 12:27:41 +0200 > > Avi Kivity <avi@xxxxxxxxxx> wrote: > > > > > On 11/17/2011 12:00 PM, Carsten Otte wrote: > > > > From: Christian Borntraeger <borntraeger@xxxxxxxxxx> > > > > > > > > There is a potential host deadlock in the tprot intercept handling. > > > > We must not hold the mmap semaphore while resolving the guest > > > > address. If userspace is remapping, then the memory detection in > > > > the guest is broken anyway so we can safely separate the > > > > address translation from walking the vmas. > > > > > > > > Signed-off-by: Christian Borntraeger <borntraeger@xxxxxxxxxx> > > > > Signed-off-by: Carsten Otte <cotte@xxxxxxxxxx> > > > > --- > > > > > > > > arch/s390/kvm/priv.c | 10 ++++++++-- > > > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > > > > > diff -urpN linux-2.6/arch/s390/kvm/priv.c linux-2.6-patched/arch/s390/kvm/priv.c > > > > --- linux-2.6/arch/s390/kvm/priv.c 2011-10-24 09:10:05.000000000 +0200 > > > > +++ linux-2.6-patched/arch/s390/kvm/priv.c 2011-11-17 10:03:53.000000000 +0100 > > > > @@ -336,6 +336,7 @@ static int handle_tprot(struct kvm_vcpu > > > > u64 address1 = disp1 + base1 ? vcpu->arch.guest_gprs[base1] : 0; > > > > u64 address2 = disp2 + base2 ? vcpu->arch.guest_gprs[base2] : 0; > > > > struct vm_area_struct *vma; > > > > + unsigned long user_address; > > > > > > > > vcpu->stat.instruction_tprot++; > > > > > > > > @@ -349,9 +350,14 @@ static int handle_tprot(struct kvm_vcpu > > > > return -EOPNOTSUPP; > > > > > > > > > > > > + /* we must resolve the address without holding the mmap semaphore. > > > > + * This is ok since the userspace hypervisor is not supposed to change > > > > + * the mapping while the guest queries the memory. Otherwise the guest > > > > + * might crash or get wrong info anyway. */ > > > > + user_address = (unsigned long) __guestaddr_to_user(vcpu, address1); > > > > + > > > > down_read(¤t->mm->mmap_sem); > > > > - vma = find_vma(current->mm, > > > > - (unsigned long) __guestaddr_to_user(vcpu, address1)); > > > > + vma = find_vma(current->mm, user_address); > > > > if (!vma) { > > > > up_read(¤t->mm->mmap_sem); > > > > return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING); > > > > > > > > > > Unrelated to the patch, but I'm curious: it looks like __gmap_fault() > > > dereferences the guest page table? How can it assume that it is mapped? > > > > The gmap code does not assume that the code is mapped. If the individual > > MB has not been mapped in the guest address space or the target memory > > is not mapped in the process address space __gmap_fault() returns -EFAULT. > > > > > I'm probably misreading the code. > > > > > > A little closer to the patch, x86 handles the same issue by calling > > > get_user_pages_fast(). This should be more scalable than bouncing > > > mmap_sem, something to consider. > > > > I don't think that the frequency of asynchronous page faults will make > > it necessary to use get_user_pages_fast(). We are talking about the > > case where I/O is necessary to provide the page that the guest accessed. > > > > The advantage of the way s390 does things is that after __gmap_fault > > translated the guest address to a user space address we can just do a > > standard page fault for the user space process. Only if that requires > > I/O we go the long way. Makes sense? > > Hmm, Carsten just made me aware that your question is not about pfault, > it is about the standard case of a guest fault. > > For normal guest faults we use a cool trick that the s390 hardware > allows us. We have the paging table for the kvm process and we have the > guest page table for execution in the virtualized context. The trick is > that the guest page table reuses the lowest level of the process page > table. A fault that sets a pte in the process page table will > automatically make that pte visible in the guest page table as well > if the memory region has been mapped in the higher order page tables. > Even the invalidation of a pte will automatically (!!) remove the > referenced page from the guest page table as well, including the TLB > entries on all cpus. The IPTE instruction is your friend :-) > That is why we don't need mm notifiers. Yes, that explains it perfectly. I congratulate you on having such friendly hardware... -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html