Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment

Jann Horn <jannh@xxxxxxxxxx> · Mon, 24 Oct 2022 21:58:07 +0200

On Mon, Oct 24, 2022 at 10:01 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Sun, Oct 23, 2022 at 10:42:49PM -0700, John Hubbard wrote:
> > On 10/22/22 04:14, Peter Zijlstra wrote:
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep
> > >
> > >  #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
> > >  /*
> > > - * WARNING: only to be used in the get_user_pages_fast() implementation.
> > > - *
> > > - * With get_user_pages_fast(), we walk down the pagetables without taking any
> > > - * locks.  For this we would like to load the pointers atomically, but sometimes
> > > - * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
> > > - * we do have is the guarantee that a PTE will only either go from not present
> > > - * to present, or present to not present or both -- it will not switch to a
> > > - * completely different present page without a TLB flush in between; something
> > > - * that we are blocking by holding interrupts off.
> > > + * For walking the pagetables without holding any locks.  Some architectures
> > > + * (eg x86-32 PAE) cannot load the entries atomically without using expensive
> > > + * instructions.  We are guaranteed that a PTE will only either go from not
> > > + * present to present, or present to not present -- it will not switch to a
> > > + * completely different present page without a TLB flush inbetween; which we
> > > + * are blocking by holding interrupts off.
> >
> >
> > This is getting interesting. My latest understanding of this story is
> > that both the "before" and "after" versions of that comment are
> > incorrect! Because, as Jann Horn noticed recently [1], there might not
> > be any IPIs involved in a TLB flush, if x86 is running under a
> > hypervisor, and that breaks the chain of reasoning here.
>
> That mail doesn't really include enough detail. The way x86 HV TLB
> flushing is supposed to work is by making use of
> MMU_GATHER_RCU_TABLE_FREE. Specifically, something like:
>
>
>         vCPU0                           vCPU1
>
>                                         tlb_gather_mmut(&tlb, mm);
>
>                                         ....
>
>         local_irq_disable();
>         ... starts page-table walk ...
>
>         <schedules out; sets KVM_VCPU_PREEMPTED>
>
>                                         tlb_finish_mmu(&tlb)
>                                           ...
>                                           kvm_flush_tlb_multi()
>                                             if (state & KVM_VCPU_PREEMPTED)
>                                               if (try_cmpxchg(,&state, state | KVM_VCPU_FLUSH_TLB))
>                                                 __cpumask_clear_cpu(cpu, flushmask);
>
>
>                                           tlb_remove_table_sync_one() / call_rcu()
>
>
>         <schedules back in>
>
>         ... continues page-table walk ...
>         local_irq_enable();
>
> If mmu gather is forced into tlb_remove_talbe_sync_one() (by memory
> pressure), then you've got your IPI back, otherwise it does call_rcu()
> and RCU itself will need vCPU0 to enable IRQs in order to make progress.
>
> Either way around, the actual freeing of the pages is delayed until the
> page-table walk is finished.
>
> What am I missing?

Unless I'm completely misunderstanding what's going on here, the whole
"remove_table" thing only happens when you "remove a table", meaning
you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
logic.