Re: [PATCH 4/5] KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages

Alexander Graf <agraf@xxxxxxx> · Mon, 26 Nov 2012 14:09:11 +0100

On 24.11.2012, at 10:32, Paul Mackerras wrote:

> On Sat, Nov 24, 2012 at 10:05:37AM +0100, Alexander Graf wrote:
>> 
>> 
>> On 23.11.2012, at 23:13, Paul Mackerras <paulus@xxxxxxxxx> wrote:
>> 
>>> On Fri, Nov 23, 2012 at 04:47:45PM +0100, Alexander Graf wrote:
>>>> 
>>>> On 22.11.2012, at 10:28, Paul Mackerras wrote:
>>>> 
>>>>> Currently, if the guest does an H_PROTECT hcall requesting that the
>>>>> permissions on a HPT entry be changed to allow writing, we make the
>>>>> requested change even if the page is marked read-only in the host
>>>>> Linux page tables.  This is a problem since it would for instance
>>>>> allow a guest to modify a page that KSM has decided can be shared
>>>>> between multiple guests.
>>>>> 
>>>>> To fix this, if the new permissions for the page allow writing, we need
>>>>> to look up the memslot for the page, work out the host virtual address,
>>>>> and look up the Linux page tables to get the PTE for the page.  If that
>>>>> PTE is read-only, we reduce the HPTE permissions to read-only.
>>>> 
>>>> How does KSM handle this usually? If you reduce the permissions to R/O, how do you ever get a R/W page from a deduplicated one?
>>> 
>>> The scenario goes something like this:
>>> 
>>> 1. Guest creates an HPTE with RO permissions.
>>> 2. KSM decides the page is identical to another page and changes the
>>>  HPTE to point to a shared copy.  Permissions are still RO.
>>> 3. Guest decides it wants write access to the page and does an
>>>  H_PROTECT hcall to change the permissions on the HPTE to RW.
>>> 
>>> The bug is that we actually make the requested change in step 3.
>>> Instead we should leave it at RO, then when the guest tries to write
>>> to the page, we take a hypervisor page fault, copy the page and give
>>> the guest write access to its own copy of the page.
>>> 
>>> So what this patch does is add code to H_PROTECT so that if the guest
>>> is requesting RW access, we check the Linux PTE to see if the
>>> underlying guest page is RO, and if so reduce the permissions in the
>>> HPTE to RO.
>> 
>> But this will be guest visible, because now H_PROTECT doesn't actually mark the page R/W in the HTAB, right?
> 
> No - the guest view of the HPTE has R/W permissions.  The guest view
> of the HPTE is made up of doubleword 0 from the real HPT plus
> rev->guest_rpte for doubleword 1 (where rev is the entry in the revmap
> array, kvm->arch.revmap, for the HPTE).  The guest view can be
> different from the host/hardware view, which is in the real HPT.  For
> instance, the guest view of a HPTE might be valid but the host view
> might be invalid because the underlying real page has been paged out -
> in that case we use a software bit which we call HPTE_V_ABSENT to
> remind ourselves that there is something valid there from the guest's
> point of view.  Or the guest view can be R/W but the host view is RO,
> as in the case where KSM has merged the page.
> 
>> So the flow with this patch is:
>> 
>>  - guest page permission fault
> 
> This comes through the host (kvmppc_hpte_hv_fault()) which looks at
> the guest view of the HPTE, sees that it has RO permissions, and sends
> the page fault to the guest.
> 
>>  - guest does H_PROTECT to mark page r/w
>>  - H_PROTECT doesn't do anything
>>  - guest returns from permission handler, triggers write fault
> 
> This comes once again to kvmppc_hpte_hv_fault(), which sees that the
> guest view of the HPTE has R/W permissions now, and sends the page
> fault to kvmppc_book3s_hv_page_fault(), which requests write access to
> the page, possibly triggering copy-on-write or whatever, and updates
> the real HPTE to have R/W permissions and possibly point to a new page
> of memory.
> 
>> 
>> 2 questions here:
>> 
>> How does the host know that the page is actually r/w?
> 
> I assume you mean RO?  It looks up the memslot for the guest physical
> address (which it gets from rev->guest_rpte), uses that to work out
> the host virtual address (i.e. the address in qemu's address space),
> looks up the Linux PTE in qemu's Linux page tables, and looks at the
> _PAGE_RW bit there.
> 
>> How does this work on 970? I thought page faults always go straight to the guest there.
> 
> They do, which is why PPC970 can't do any of this.  On PPC970 we have
> kvm->arch.using_mmu_notifiers == 0, and that makes the code pin every
> page of guest memory that is mapped by a guest HPTE (with a Linux
> guest, that means every page, because of the linear mapping).  On
> POWER7 we have kvm->arch.using_mmu_notifiers == 1, which enables
> host paging and deduplication of guest memory.

Thanks a lot for the detailed explanation! Maybe you guys should just release an HV capable p7 system publicly, so we can deprecate 970 support. That would make a few things quite a bit easier ;)

Thanks, applied to kvm-ppc-next.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html