Re: [PATCH v2] ARM/arm64: KVM: test properly for a PTE's uncachedness

Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> · Tue, 10 Nov 2015 14:15:45 +0100

On 10 November 2015 at 13:22, Christoffer Dall
<christoffer.dall@xxxxxxxxxx> wrote:
> On Tue, Nov 10, 2015 at 10:45:37AM +0100, Ard Biesheuvel wrote:
>> Hi all,
>>
>> I wonder if this is a better way to address the problem. It looks at
>> the nature of the memory rather than the nature of the mapping, which
>> is probably a more reliable indicator of whether cache maintenance is
>> required when performing the unmap.
>>
>>
>> -----------8<----------------
>> The open coded tests for checking whether a PTE maps a page as
>> uncached use a flawed 'pte_val(xxx) & CONST != CONST' pattern,
>> which is not guaranteed to work since the type of a mapping is
>> not a set of mutually exclusive bits
>>
>> For HYP mappings, the type is an index into the MAIR table (i.e, the
>> index itself does not contain any information whatsoever about the
>> type of the mapping), and for stage-2 mappings it is a bit field where
>> normal memory and device types are defined as follows:
>>
>>     #define MT_S2_NORMAL            0xf
>>     #define MT_S2_DEVICE_nGnRE      0x1
>>
>> I.e., masking *and* comparing with the latter matches on the former,
>> and we have been getting lucky merely because the S2 device mappings
>> also have the PTE_UXN bit set, or we would misidentify memory mappings
>> as device mappings.
>>
>> Since the unmap_range() code path (which contains one instance of the
>> flawed test) is used both for HYP mappings and stage-2 mappings, and
>> considering the difference between the two, it is non-trivial to fix
>> this by rewriting the tests in place, as it would involve passing
>> down the type of mapping through all the functions.
>>
>> However, since HYP mappings and stage-2 mappings both deal with host
>> physical addresses, we can simply check whether the mapping is backed
>> by memory that is managed by the host kernel, and only perform the
>> D-cache maintenance if this is the case.
>>
>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>
>> ---
>>  arch/arm/kvm/mmu.c | 15 +++++++--------
>>  1 file changed, 7 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 6984342da13d..7dace909d5cf 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -98,6 +98,11 @@ static void kvm_flush_dcache_pud(pud_t pud)
>>       __kvm_flush_dcache_pud(pud);
>>  }
>>
>> +static bool kvm_is_device_pfn(unsigned long pfn)
>> +{
>> +     return !pfn_valid(pfn);
>> +}
>> +
>>  /**
>>   * stage2_dissolve_pmd() - clear and flush huge PMD entry
>>   * @kvm:     pointer to kvm structure.
>> @@ -213,7 +218,7 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
>>                       kvm_tlb_flush_vmid_ipa(kvm, addr);
>>
>>                       /* No need to invalidate the cache for device mappings */
>> -                     if ((pte_val(old_pte) & PAGE_S2_DEVICE) != PAGE_S2_DEVICE)
>> +                     if (!kvm_is_device_pfn(__phys_to_pfn(addr)))
>>                               kvm_flush_dcache_pte(old_pte);
>>
>>                       put_page(virt_to_page(pte));
>> @@ -305,8 +310,7 @@ static void stage2_flush_ptes(struct kvm *kvm, pmd_t *pmd,
>>
>>       pte = pte_offset_kernel(pmd, addr);
>>       do {
>> -             if (!pte_none(*pte) &&
>> -                 (pte_val(*pte) & PAGE_S2_DEVICE) != PAGE_S2_DEVICE)
>> +             if (!pte_none(*pte) && !kvm_is_device_pfn(__phys_to_pfn(addr)))
>>                       kvm_flush_dcache_pte(*pte);
>>       } while (pte++, addr += PAGE_SIZE, addr != end);
>>  }
>> @@ -1037,11 +1041,6 @@ static bool kvm_is_write_fault(struct kvm_vcpu *vcpu)
>>       return kvm_vcpu_dabt_iswrite(vcpu);
>>  }
>>
>> -static bool kvm_is_device_pfn(unsigned long pfn)
>> -{
>> -     return !pfn_valid(pfn);
>> -}
>> -
>>  /**
>>   * stage2_wp_ptes - write protect PMD range
>>   * @pmd:     pointer to pmd entry
>> --
>> 1.9.1
>>
>
> So PAGE_HYP_DEVICE is used only to map the vgic-v2 regions and
> PAGE_S2_DEVICE is used to map the vgic VCPU interface and for all memory
> regions where the vma has (vm_flags & VM_PFNMAP).
>
> Will these, and only these, cases be covered by the pfn_valid check?
>

The pfn_valid() check will ensure that cache maintenance is only
performed on regions that are known to the host as memory, are managed
by the host (i.e., there is a struct page associated with them) and
will be accessed by the host via cacheable mappings (they are covered
by the linear mapping, or [on ARM] will be kmap'ed cacheable if they
are highmem). If you look at the commit that introduced these tests
(363ef89f8e9b arm/arm64: KVM: Invalidate data cache on unmap), the
concern it addresses is that the guest may perform uncached accesses
to regions that the host has mapped cacheable, meaning guest writes
may be shadowed by clean cachelines, making them invisble to cache
coherent I/O. So afaict, pfn_valid() is an appropriate check here.

pfn_valid() will not misidentify device regions as memory (unless the
host is really broken) so this should fix Pavel's case. The converse
case (a memory region misidentified as a device) could only happen for
a carve-out (i.e., via the /reserved-memory node) that is mapped
inside the guest via a pass-through (VM_PFNMAP) mapping. That case is
already dodgy, since the guest accesses would be forced
uncached/ordered due to the fact that those mappings are forced
PAGE_S2_DEVICE at stage 2 (as you mention), and would also be
misidentified by the current code (due to the PAGE_S2_DEVICE
attributes)

-- 
Ard.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html