On Mon, Mar 23, 2020 at 07:40:31AM -0700, Sean Christopherson wrote: > On Sun, Mar 22, 2020 at 07:54:32PM -0700, Mike Kravetz wrote: > > On 3/22/20 7:03 PM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: > > > > > > On 2020/3/22 7:38, Mike Kravetz wrote: > > >> On 2/21/20 7:33 PM, Longpeng(Mike) wrote: > > >>> From: Longpeng <longpeng2@xxxxxxxxxx> > > I have not looked closely at the generated code for lookup_address_in_pgd. > > It appears that it would dereference p4d, pud and pmd multiple times. Sean > > seemed to think there was something about the calling context that would > > make issues like those seen with huge_pte_offset less likely to happen. I > > do not know if this is accurate or not. > > Only for KVM's calls to lookup_address_in_mm(), I can't speak to other > calls that funnel into to lookup_address_in_pgd(). > > KVM uses a combination of tracking and blocking mmu_notifier calls to ensure > PTE changes/invalidations between gup() and lookup_address_in_pgd() cause a > restart of the faulting instruction, and that pending changes/invalidations > are blocked until installation of the pfn in KVM's secondary MMU completes. > > kvm_mmu_page_fault(): > > mmu_seq = kvm->mmu_notifier_seq; > smp_rmb(); > > pfn = gup(hva); > > spin_lock(&kvm->mmu_lock); > smp_rmb(); > if (kvm->mmu_notifier_seq != mmu_seq) > goto out_unlock: // Restart guest, i.e. retry the fault > > lookup_address_in_mm(hva, ...); It works because the mmu_lock spinlock is taken before and after any change to the page table via invalidate_range_start/end() callbacks. So if you are in the spinlock and mmu_notifier_count == 0, then nobody can be writing to the page tables. It is effectively a full page table lock, so any page table read under that lock do not need to worry about any data races. Jason