Hi Oliver, On Thu, Mar 31, 2022 at 2:38 PM Oliver Upton <oupton@xxxxxxxxxx> wrote: > > It is possible to take a stage-2 permission fault on a page larger than > PAGE_SIZE. For example, when running a guest backed by 2M HugeTLB, KVM > eagerly maps at the largest possible block size. When dirty logging is > enabled on a memslot, KVM does *not* eagerly split these 2M stage-2 > mappings and instead clears the write bit on the pte. > > Since dirty logging is always performed at PAGE_SIZE granularity, KVM > lazily splits these 2M block mappings down to PAGE_SIZE in the stage-2 > fault handler. This operation must be done under the write lock. Since > commit f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission > relaxation during dirty logging"), the stage-2 fault handler > conditionally takes the read lock on permission faults with dirty > logging enabled. To that end, it is possible to split a 2M block mapping > while only holding the read lock. > > The problem is demonstrated by running kvm_page_table_test with 2M > anonymous HugeTLB, which splats like so: > > WARNING: CPU: 5 PID: 15276 at arch/arm64/kvm/hyp/pgtable.c:153 stage2_map_walk_leaf+0x124/0x158 > > [...] > > Call trace: > stage2_map_walk_leaf+0x124/0x158 > stage2_map_walker+0x5c/0xf0 > __kvm_pgtable_walk+0x100/0x1d4 > __kvm_pgtable_walk+0x140/0x1d4 > __kvm_pgtable_walk+0x140/0x1d4 > kvm_pgtable_walk+0xa0/0xf8 > kvm_pgtable_stage2_map+0x15c/0x198 > user_mem_abort+0x56c/0x838 > kvm_handle_guest_abort+0x1fc/0x2a4 > handle_exit+0xa4/0x120 > kvm_arch_vcpu_ioctl_run+0x200/0x448 > kvm_vcpu_ioctl+0x588/0x664 > __arm64_sys_ioctl+0x9c/0xd4 > invoke_syscall+0x4c/0x144 > el0_svc_common+0xc4/0x190 > do_el0_svc+0x30/0x8c > el0_svc+0x28/0xcc > el0t_64_sync_handler+0x84/0xe4 > el0t_64_sync+0x1a4/0x1a8 > > Fix the issue by only acquiring the read lock if the guest faulted on a > PAGE_SIZE granule w/ dirty logging enabled. Since it is possible for the > faulting IPA to get collapsed into a larger block mapping until the read > lock is acquired, retry the faulting instruction any time that the fault > cannot be fixed by relaxing permissions. In so doing, the fault handler > will acquire the write lock for the subsequent fault on a larger > PAGE_SIZE mapping and split the block safely behind the write lock. > > Fixes: f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging") > Cc: Jing Zhang <jingzhangos@xxxxxxxxxx> > Signed-off-by: Oliver Upton <oupton@xxxxxxxxxx> > --- > > Applies cleanly to kvmarm/fixes at the following commit: > > 8872d9b3e35a ("KVM: arm64: Drop unneeded minor version check from PSCI v1.x handler") > > Tested the patch by running KVM selftests. Additionally, I did 10 > iterations of the kvm_page_table_test with 2M anon HugeTLB memory. > > It is expected that this patch will cause fault serialization in the > pathological case where all vCPUs are faulting on the same granule of > memory, as every vCPU will attempt to acquire the write lock. The only > safe way to cure this contention is to dissolve pages eagerly outside of > the stage-2 fault handler (like x86). > > arch/arm64/kvm/mmu.c | 25 ++++++++++++++++++++----- > 1 file changed, 20 insertions(+), 5 deletions(-) > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 0d19259454d8..9384325bf3df 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1079,7 +1079,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > gfn_t gfn; > kvm_pfn_t pfn; > bool logging_active = memslot_is_logging(memslot); > - bool logging_perm_fault = false; > + bool use_read_lock = false; > unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu); > unsigned long vma_pagesize, fault_granule; > enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; > @@ -1114,7 +1114,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > if (logging_active) { > force_pte = true; > vma_shift = PAGE_SHIFT; > - logging_perm_fault = (fault_status == FSC_PERM && write_fault); > + use_read_lock = (fault_status == FSC_PERM && write_fault && > + fault_granule == PAGE_SIZE); > } else { > vma_shift = get_vma_page_shift(vma, hva); > } > @@ -1218,7 +1219,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > * logging dirty logging, only acquire read lock for permission > * relaxation. > */ > - if (logging_perm_fault) > + if (use_read_lock) > read_lock(&kvm->mmu_lock); > else > write_lock(&kvm->mmu_lock); > @@ -1267,10 +1268,24 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > */ > if (fault_status == FSC_PERM && vma_pagesize == fault_granule) { > ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot); When use_read_lock is set to true, it appears the above condition will become always true (since fault_granule is PAGE_SIZE and force_pte is true in this case). So, I don't think the following "else" changes really make any difference. (Or am I overlooking something?) Looking at the code, I doubt that even the original (before the regression) code detects the case that is described in the comment below in the first place. Thanks, Reiji > - } else { > + } else if (!use_read_lock) { > ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize, > __pfn_to_phys(pfn), prot, > memcache); > + > + /* > + * The read lock is taken if the FSC indicates that the guest faulted on > + * a PAGE_SIZE granule. It is possible that the stage-2 fault raced with > + * a map operation that collapsed the faulted address into a larger > + * block mapping. > + * > + * Since KVM splits mappings down to PAGE_SIZE when dirty logging is > + * enabled, it is necessary to hold the write lock for faults where > + * fault_granule > PAGE_SIZE. Retry the faulting instruction and acquire > + * the write lock on the next exit. > + */ > + } else { > + ret = -EAGAIN; > } > > /* Mark the page dirty only if the fault is handled successfully */ > @@ -1280,7 +1295,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > } > > out_unlock: > - if (logging_perm_fault) > + if (use_read_lock) > read_unlock(&kvm->mmu_lock); > else > write_unlock(&kvm->mmu_lock); > -- > 2.35.1.1094.g7c7d902a7c-goog >