On 24.11.2012, at 10:32, Paul Mackerras wrote: > On Sat, Nov 24, 2012 at 10:05:37AM +0100, Alexander Graf wrote: >> >> >> On 23.11.2012, at 23:13, Paul Mackerras <paulus@xxxxxxxxx> wrote: >> >>> On Fri, Nov 23, 2012 at 04:47:45PM +0100, Alexander Graf wrote: >>>> >>>> On 22.11.2012, at 10:28, Paul Mackerras wrote: >>>> >>>>> Currently, if the guest does an H_PROTECT hcall requesting that the >>>>> permissions on a HPT entry be changed to allow writing, we make the >>>>> requested change even if the page is marked read-only in the host >>>>> Linux page tables. This is a problem since it would for instance >>>>> allow a guest to modify a page that KSM has decided can be shared >>>>> between multiple guests. >>>>> >>>>> To fix this, if the new permissions for the page allow writing, we need >>>>> to look up the memslot for the page, work out the host virtual address, >>>>> and look up the Linux page tables to get the PTE for the page. If that >>>>> PTE is read-only, we reduce the HPTE permissions to read-only. >>>> >>>> How does KSM handle this usually? If you reduce the permissions to R/O, how do you ever get a R/W page from a deduplicated one? >>> >>> The scenario goes something like this: >>> >>> 1. Guest creates an HPTE with RO permissions. >>> 2. KSM decides the page is identical to another page and changes the >>> HPTE to point to a shared copy. Permissions are still RO. >>> 3. Guest decides it wants write access to the page and does an >>> H_PROTECT hcall to change the permissions on the HPTE to RW. >>> >>> The bug is that we actually make the requested change in step 3. >>> Instead we should leave it at RO, then when the guest tries to write >>> to the page, we take a hypervisor page fault, copy the page and give >>> the guest write access to its own copy of the page. >>> >>> So what this patch does is add code to H_PROTECT so that if the guest >>> is requesting RW access, we check the Linux PTE to see if the >>> underlying guest page is RO, and if so reduce the permissions in the >>> HPTE to RO. >> >> But this will be guest visible, because now H_PROTECT doesn't actually mark the page R/W in the HTAB, right? > > No - the guest view of the HPTE has R/W permissions. The guest view > of the HPTE is made up of doubleword 0 from the real HPT plus > rev->guest_rpte for doubleword 1 (where rev is the entry in the revmap > array, kvm->arch.revmap, for the HPTE). The guest view can be > different from the host/hardware view, which is in the real HPT. For > instance, the guest view of a HPTE might be valid but the host view > might be invalid because the underlying real page has been paged out - > in that case we use a software bit which we call HPTE_V_ABSENT to > remind ourselves that there is something valid there from the guest's > point of view. Or the guest view can be R/W but the host view is RO, > as in the case where KSM has merged the page. > >> So the flow with this patch is: >> >> - guest page permission fault > > This comes through the host (kvmppc_hpte_hv_fault()) which looks at > the guest view of the HPTE, sees that it has RO permissions, and sends > the page fault to the guest. > >> - guest does H_PROTECT to mark page r/w >> - H_PROTECT doesn't do anything >> - guest returns from permission handler, triggers write fault > > This comes once again to kvmppc_hpte_hv_fault(), which sees that the > guest view of the HPTE has R/W permissions now, and sends the page > fault to kvmppc_book3s_hv_page_fault(), which requests write access to > the page, possibly triggering copy-on-write or whatever, and updates > the real HPTE to have R/W permissions and possibly point to a new page > of memory. > >> >> 2 questions here: >> >> How does the host know that the page is actually r/w? > > I assume you mean RO? It looks up the memslot for the guest physical > address (which it gets from rev->guest_rpte), uses that to work out > the host virtual address (i.e. the address in qemu's address space), > looks up the Linux PTE in qemu's Linux page tables, and looks at the > _PAGE_RW bit there. > >> How does this work on 970? I thought page faults always go straight to the guest there. > > They do, which is why PPC970 can't do any of this. On PPC970 we have > kvm->arch.using_mmu_notifiers == 0, and that makes the code pin every > page of guest memory that is mapped by a guest HPTE (with a Linux > guest, that means every page, because of the linear mapping). On > POWER7 we have kvm->arch.using_mmu_notifiers == 1, which enables > host paging and deduplication of guest memory. Thanks a lot for the detailed explanation! Maybe you guys should just release an HV capable p7 system publicly, so we can deprecate 970 support. That would make a few things quite a bit easier ;) Thanks, applied to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html