On Sat, Nov 24, 2012 at 10:05:37AM +0100, Alexander Graf wrote: > > > On 23.11.2012, at 23:13, Paul Mackerras <paulus@xxxxxxxxx> wrote: > > > On Fri, Nov 23, 2012 at 04:47:45PM +0100, Alexander Graf wrote: > >> > >> On 22.11.2012, at 10:28, Paul Mackerras wrote: > >> > >>> Currently, if the guest does an H_PROTECT hcall requesting that the > >>> permissions on a HPT entry be changed to allow writing, we make the > >>> requested change even if the page is marked read-only in the host > >>> Linux page tables. This is a problem since it would for instance > >>> allow a guest to modify a page that KSM has decided can be shared > >>> between multiple guests. > >>> > >>> To fix this, if the new permissions for the page allow writing, we need > >>> to look up the memslot for the page, work out the host virtual address, > >>> and look up the Linux page tables to get the PTE for the page. If that > >>> PTE is read-only, we reduce the HPTE permissions to read-only. > >> > >> How does KSM handle this usually? If you reduce the permissions to R/O, how do you ever get a R/W page from a deduplicated one? > > > > The scenario goes something like this: > > > > 1. Guest creates an HPTE with RO permissions. > > 2. KSM decides the page is identical to another page and changes the > > HPTE to point to a shared copy. Permissions are still RO. > > 3. Guest decides it wants write access to the page and does an > > H_PROTECT hcall to change the permissions on the HPTE to RW. > > > > The bug is that we actually make the requested change in step 3. > > Instead we should leave it at RO, then when the guest tries to write > > to the page, we take a hypervisor page fault, copy the page and give > > the guest write access to its own copy of the page. > > > > So what this patch does is add code to H_PROTECT so that if the guest > > is requesting RW access, we check the Linux PTE to see if the > > underlying guest page is RO, and if so reduce the permissions in the > > HPTE to RO. > > But this will be guest visible, because now H_PROTECT doesn't actually mark the page R/W in the HTAB, right? No - the guest view of the HPTE has R/W permissions. The guest view of the HPTE is made up of doubleword 0 from the real HPT plus rev->guest_rpte for doubleword 1 (where rev is the entry in the revmap array, kvm->arch.revmap, for the HPTE). The guest view can be different from the host/hardware view, which is in the real HPT. For instance, the guest view of a HPTE might be valid but the host view might be invalid because the underlying real page has been paged out - in that case we use a software bit which we call HPTE_V_ABSENT to remind ourselves that there is something valid there from the guest's point of view. Or the guest view can be R/W but the host view is RO, as in the case where KSM has merged the page. > So the flow with this patch is: > > - guest page permission fault This comes through the host (kvmppc_hpte_hv_fault()) which looks at the guest view of the HPTE, sees that it has RO permissions, and sends the page fault to the guest. > - guest does H_PROTECT to mark page r/w > - H_PROTECT doesn't do anything > - guest returns from permission handler, triggers write fault This comes once again to kvmppc_hpte_hv_fault(), which sees that the guest view of the HPTE has R/W permissions now, and sends the page fault to kvmppc_book3s_hv_page_fault(), which requests write access to the page, possibly triggering copy-on-write or whatever, and updates the real HPTE to have R/W permissions and possibly point to a new page of memory. > > 2 questions here: > > How does the host know that the page is actually r/w? I assume you mean RO? It looks up the memslot for the guest physical address (which it gets from rev->guest_rpte), uses that to work out the host virtual address (i.e. the address in qemu's address space), looks up the Linux PTE in qemu's Linux page tables, and looks at the _PAGE_RW bit there. > How does this work on 970? I thought page faults always go straight to the guest there. They do, which is why PPC970 can't do any of this. On PPC970 we have kvm->arch.using_mmu_notifiers == 0, and that makes the code pin every page of guest memory that is mapped by a guest HPTE (with a Linux guest, that means every page, because of the linear mapping). On POWER7 we have kvm->arch.using_mmu_notifiers == 1, which enables host paging and deduplication of guest memory. Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html