On Wed, Jul 21, 2021, Will Deacon wrote: > > For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is > > invoked at the beginning of exit_mmap(), before the page tables are freed. In > > its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a. > > the stage2 tables in KVM arm64. The flow in question, get_user_mapping_size(), > > also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is > > guaranteed to run with live userspace tables. > > Unless I missed a case, exit_mmap() only runs when mm_struct::mm_users drops > to zero, right? Yep. > The vCPU tasks should hold references to that afaict, so I don't think it > should be possible for exit_mmap() to run while there are vCPUs running with > the corresponding page-table. Ah, right, I was thinking of non-KVM code that operated on the page tables without holding a reference to mm_users. > > Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly > > handles the case where exit_mmap() wins the race. The invalidate_range hooks will > > still be called, so userspace page tables aren't a problem, but > > kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without > > any additional notifications that I see. x86 deals with this by ensuring its > > top-level TDP entry (stage2 equivalent) is valid while the page fault handler is > > running. > > But the fact that x86 handles this race has me worried. What am I missing? I don't think you're missing anything. I forgot that KVM_RUN would require an elevated mm_users. x86 does handle the impossible race, but that's coincidental. The extra protections in x86 are to deal with other cases where a vCPU's top-level SPTE can be invalidated while the vCPU is running.