On 9/23/19 12:40 PM, Leonardo Bras wrote: > On Mon, 2019-09-23 at 11:14 -0700, John Hubbard wrote: >> On 9/23/19 10:25 AM, Leonardo Bras wrote: >> [...] >> That part is all fine, but there are no run-time memory barriers in the >> atomic_inc() and atomic_dec() additions, which means that this is not >> safe, because memory operations on CPU 1 can be reordered. It's safe >> as shown *if* there are memory barriers to keep the order as shown: >> >> CPU 0 CPU 1 >> ------ -------------- >> atomic_inc(val) (no run-time memory barrier!) >> pmd_clear(pte) >> if (val) >> run_on_all_cpus(): IPI >> local_irq_disable() (also not a mem barrier) >> >> READ(pte) >> if(pte) >> walk page tables >> >> local_irq_enable() (still not a barrier) >> atomic_dec(val) >> >> free(pte) >> >> thanks, > > This is serialize: > > void serialize_against_pte_lookup(struct mm_struct *mm) > { > smp_mb(); > if (running_lockless_pgtbl_walk(mm)) > smp_call_function_many(mm_cpumask(mm), do_nothing, > NULL, 1); > } > > That would mean: > > CPU 0 CPU 1 > ------ -------------- > atomic_inc(val) > pmd_clear(pte) > smp_mb() > if (val) > run_on_all_cpus(): IPI > local_irq_disable() > > READ(pte) > if(pte) > walk page tables > > local_irq_enable() (still not a barrier) > atomic_dec(val) > > By https://www.kernel.org/doc/Documentation/memory-barriers.txt : > 'If you need all the CPUs to see a given store at the same time, use > smp_mb().' > > Is it not enough? Nope. CPU 1 memory accesses could be re-ordered, as I said above: CPU 0 CPU 1 ------ -------------- READ(pte) (re-ordered at run time) atomic_inc(val) (no run-time memory barrier!) pmd_clear(pte) if (val) run_on_all_cpus(): IPI local_irq_disable() (also not a mem barrier) if(pte) walk page tables ... > Do you suggest adding 'smp_mb()' after atomic_{inc,dec} ? > Yes (approximately: I'd have to look closer to see which barrier call is really required). Unless there is something else that is providing the barrier, which is why I called this a pre-existing question: it seems like the interrupt interlock in the current gup_fast() might not have what it needs. In other words, if your code needs a barrier, then the pre-existing gup_fast() code probably does, too. thanks, -- John Hubbard NVIDIA