On Mon, Oct 24, 2022 at 10:01 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Sun, Oct 23, 2022 at 10:42:49PM -0700, John Hubbard wrote: > > On 10/22/22 04:14, Peter Zijlstra wrote: > > > --- a/include/linux/pgtable.h > > > +++ b/include/linux/pgtable.h > > > @@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep > > > > > > #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH > > > /* > > > - * WARNING: only to be used in the get_user_pages_fast() implementation. > > > - * > > > - * With get_user_pages_fast(), we walk down the pagetables without taking any > > > - * locks. For this we would like to load the pointers atomically, but sometimes > > > - * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What > > > - * we do have is the guarantee that a PTE will only either go from not present > > > - * to present, or present to not present or both -- it will not switch to a > > > - * completely different present page without a TLB flush in between; something > > > - * that we are blocking by holding interrupts off. > > > + * For walking the pagetables without holding any locks. Some architectures > > > + * (eg x86-32 PAE) cannot load the entries atomically without using expensive > > > + * instructions. We are guaranteed that a PTE will only either go from not > > > + * present to present, or present to not present -- it will not switch to a > > > + * completely different present page without a TLB flush inbetween; which we > > > + * are blocking by holding interrupts off. > > > > > > This is getting interesting. My latest understanding of this story is > > that both the "before" and "after" versions of that comment are > > incorrect! Because, as Jann Horn noticed recently [1], there might not > > be any IPIs involved in a TLB flush, if x86 is running under a > > hypervisor, and that breaks the chain of reasoning here. > > That mail doesn't really include enough detail. The way x86 HV TLB > flushing is supposed to work is by making use of > MMU_GATHER_RCU_TABLE_FREE. Specifically, something like: > > > vCPU0 vCPU1 > > tlb_gather_mmut(&tlb, mm); > > .... > > local_irq_disable(); > ... starts page-table walk ... > > <schedules out; sets KVM_VCPU_PREEMPTED> > > tlb_finish_mmu(&tlb) > ... > kvm_flush_tlb_multi() > if (state & KVM_VCPU_PREEMPTED) > if (try_cmpxchg(,&state, state | KVM_VCPU_FLUSH_TLB)) > __cpumask_clear_cpu(cpu, flushmask); > > > tlb_remove_table_sync_one() / call_rcu() > > > <schedules back in> > > ... continues page-table walk ... > local_irq_enable(); > > If mmu gather is forced into tlb_remove_talbe_sync_one() (by memory > pressure), then you've got your IPI back, otherwise it does call_rcu() > and RCU itself will need vCPU0 to enable IRQs in order to make progress. > > Either way around, the actual freeing of the pages is delayed until the > page-table walk is finished. > > What am I missing? Unless I'm completely misunderstanding what's going on here, the whole "remove_table" thing only happens when you "remove a table", meaning you free an entire *pagetable*. Just zapping PTEs doesn't trigger that logic.