Hi Marc, On 9/9/20 6:12 PM, Marc Zyngier wrote: > Hi Alex, > > On 2020-09-09 15:20, Alexandru Elisei wrote: >> Hi Will, >> >> On 9/7/20 4:23 PM, Will Deacon wrote: >>> Convert user_mem_abort() to call kvm_pgtable_stage2_relax_perms() when >>> handling a stage-2 permission fault and kvm_pgtable_stage2_map() when >>> handling a stage-2 translation fault, rather than walking the page-table >>> manually. >>> >>> Cc: Marc Zyngier <maz@xxxxxxxxxx> >>> Cc: Quentin Perret <qperret@xxxxxxxxxx> >>> Reviewed-by: Gavin Shan <gshan@xxxxxxxxxx> >>> Signed-off-by: Will Deacon <will@xxxxxxxxxx> >>> --- >>> arch/arm64/kvm/mmu.c | 124 +++++++++++++++---------------------------- >>> 1 file changed, 44 insertions(+), 80 deletions(-) >>> >>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>> index 0af48f35c8dd..dc923e873dad 100644 >>> --- a/arch/arm64/kvm/mmu.c >>> +++ b/arch/arm64/kvm/mmu.c >>> @@ -1496,18 +1496,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, >>> phys_addr_t fault_ipa, >>> { >>> int ret; >>> bool write_fault, writable, force_pte = false; >>> - bool exec_fault, needs_exec; >>> + bool exec_fault; >>> + bool device = false; >>> unsigned long mmu_seq; >>> - gfn_t gfn = fault_ipa >> PAGE_SHIFT; >>> struct kvm *kvm = vcpu->kvm; >>> struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache; >>> struct vm_area_struct *vma; >>> short vma_shift; >>> + gfn_t gfn; >>> kvm_pfn_t pfn; >>> - pgprot_t mem_type = PAGE_S2; >>> bool logging_active = memslot_is_logging(memslot); >>> - unsigned long vma_pagesize, flags = 0; >>> - struct kvm_s2_mmu *mmu = vcpu->arch.hw_mmu; >>> + unsigned long vma_pagesize; >>> + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; >>> + struct kvm_pgtable *pgt; >>> >>> write_fault = kvm_is_write_fault(vcpu); >>> exec_fault = kvm_vcpu_trap_is_iabt(vcpu); >>> @@ -1540,22 +1541,24 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, >>> phys_addr_t fault_ipa, >>> vma_pagesize = PAGE_SIZE; >>> } >>> >>> - /* >>> - * The stage2 has a minimum of 2 level table (For arm64 see >>> - * kvm_arm_setup_stage2()). Hence, we are guaranteed that we can >>> - * use PMD_SIZE huge mappings (even when the PMD is folded into PGD). >>> - * As for PUD huge maps, we must make sure that we have at least >>> - * 3 levels, i.e, PMD is not folded. >>> - */ >>> - if (vma_pagesize == PMD_SIZE || >>> - (vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm))) >>> - gfn = (fault_ipa & huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT; >>> + if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE) >>> + fault_ipa &= huge_page_mask(hstate_vma(vma)); >> >> This looks correct to me - if !kvm_stage2_has_pmd(), then PMD is folded onto PUD >> and PGD, and PMD_SIZE == PUD_SIZE. Also I like the fact that we update >> gfn **and** >> fault_ipa, the previous version updated only gfn, which made gfn != >> (fault_ipa >> >> PAGE_SHIFT). >> >>> + >>> + gfn = fault_ipa >> PAGE_SHIFT; >>> mmap_read_unlock(current->mm); >>> >>> - /* We need minimum second+third level pages */ >>> - ret = kvm_mmu_topup_memory_cache(memcache, kvm_mmu_cache_min_pages(kvm)); >>> - if (ret) >>> - return ret; >>> + /* >>> + * Permission faults just need to update the existing leaf entry, >>> + * and so normally don't require allocations from the memcache. The >>> + * only exception to this is when dirty logging is enabled at runtime >>> + * and a write fault needs to collapse a block entry into a table. >>> + */ >>> + if (fault_status != FSC_PERM || (logging_active && write_fault)) { >>> + ret = kvm_mmu_topup_memory_cache(memcache, >>> + kvm_mmu_cache_min_pages(kvm)); >>> + if (ret) >>> + return ret; >>> + } >> >> I'm not 100% sure about this. >> >> I don't think we gain much over the previous code - if we had allocated cache >> objects which we hadn't used, we would have used them next time user_mem_abort() >> is called (kvm_mmu_topup_memory_cache() checks if we have the required number of >> objects in the cache and returns early). >> >> I'm not sure the condition is entirely correct either - if stage 2 already has a >> mapping for the IPA and we only need to set write permissions, according to the >> condition above we still try to topup the cache, even though we don't strictly >> need to. > > That's because if you are logging, you may have to split an existing block > mapping and map a single page instead. This requires (at least) an extra > level, and that's why you need to top-up the cache in this case. > >> >>> [..] >>> >>> - >>> - if (needs_exec) >>> - new_pmd = kvm_s2pmd_mkexec(new_pmd); >>> + if (device) >>> + prot |= KVM_PGTABLE_PROT_DEVICE; >>> + else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC)) >>> + prot |= KVM_PGTABLE_PROT_X; >>> >>> - ret = stage2_set_pmd_huge(mmu, memcache, fault_ipa, &new_pmd); >>> + if (fault_status == FSC_PERM && !(logging_active && writable)) { >> >> I don't understand the second part of the condition (!(logging_active && >> writable)). With logging active, when we get a fault because of a >> missing stage 2 >> entry, we map the IPA as read-only at stage 2. If I understand this code >> correctly, when the guest then tries to write to the same IPA, writable == true >> and we map the IPA again instead of relaxing the permissions. Why is that? > > See my reply above: logging means potentially adding a new level, so we > treat it as a new mapping altogether (break the block mapping, TLBI, install > the new mapping one level down). > > All the other cases are happily handled by just relaxing the permissions. I think I understand now, I didn't realize that we never check if the IPA is mapped when we get a write fault with a dirty page logging memslot. We unconditionally map a new page marked writable, regardless if the IPA was mapped or not, or if it was mapped using a block mapping or a page. The code makes sense with that in mind. For what it's worth: Reviewed-by: Alexandru Elisei <alexandru.elisei@xxxxxxx> Thanks, Alex _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm