On 9/20/19 1:28 PM, Leonardo Bras wrote: > On Fri, 2019-09-20 at 13:11 -0700, John Hubbard wrote: >> On 9/20/19 12:50 PM, Leonardo Bras wrote: >>> Skips slow part of serialize_against_pte_lookup if there is no running >>> lockless pagetable walk. >>> >>> Signed-off-by: Leonardo Bras <leonardo@xxxxxxxxxxxxx> >>> --- >>> arch/powerpc/mm/book3s64/pgtable.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c >>> index 13239b17a22c..41ca30269fa3 100644 >>> --- a/arch/powerpc/mm/book3s64/pgtable.c >>> +++ b/arch/powerpc/mm/book3s64/pgtable.c >>> @@ -95,7 +95,8 @@ static void do_nothing(void *unused) >>> void serialize_against_pte_lookup(struct mm_struct *mm) >>> { >>> smp_mb(); >>> - smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); >>> + if (running_lockless_pgtbl_walk(mm)) >>> + smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); >> >> Hi, >> >> If you do this, then you are left without any synchronization. So it will >> have race conditions: a page table walk could begin right after the above >> check returns "false", and then code such as hash__pmdp_huge_get_and_clear() >> will continue on right away, under the false assumption that it has let >> all the current page table walks complete. >> >> The current code uses either interrupts or RCU to synchronize, and in >> either case, you end up scheduling something on each CPU. If you remove >> that entirely, I don't see anything left. ("Pure" atomic counting is not >> a synchronization technique all by itself.) >> >> thanks, > > Hello John, > Thanks for the fast feedback. > > See, before calling serialize_against_pte_lookup(), there is always an > update or clear on the pmd. So, if a page table walk begin right after > the check returns "false", there is no problem, since it will use the > updated pmd. > > Think about serialize, on a process with a bunch of cpus. After you > check the last processor (wait part), there is no guarantee that the > first one is not starting a lockless pagetable walk. > > The same mechanism protect both methods. > > Does it make sense? > Yes, after working through this with Mark Hairgrove, I think I finally realize that the new code will allow existing gup_fast() readers to drain, before proceeding. So that technically works (ignoring issues such as whether it's desirable to use this approach, vs. for example batching the THP updates, etc), I agree. (And please ignore my other response that asked if the counting was helping at all--I see that it does.) However, Mark pointed out a pre-existing question, which neither of us could figure out: are the irq disable/enable calls effective, given that they are (apparently) not memory barriers? Given this claim from Documentation/memory-barriers.txt: INTERRUPT DISABLING FUNCTIONS ----------------------------- Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts (RELEASE equivalent) will act as compiler barriers only. So if memory or I/O barriers are required in such a situation, they must be provided from some other means. ...and given both the preexisting code, and the code you've added: mm/gup.c: atomic_inc(readers); /* new code */ local_irq_disable(); gup_pgd_range(); ...read page tables local_irq_enable(); atomic_dec(readers); /* new code */ ...if the page table reads are allowed to speculatively happen *outside* of the irq enable/disable calls (which could happen if there are no run-time memory barriers in the above code), then nothing works anymore. So it seems that full memory barriers (not just compiler barriers) are required. If the irq enable/disable somehow provides that, then your new code just goes along for the ride and Just Works. (You don't have any memory barriers in start_lockless_pgtbl_walk() / end_lockless_pgtbl_walk(), just the compiler barriers provided by the atomic inc/dec.) So it's really a pre-existing question about the correctness of the gup_fast() irq disabling approach. +CC linux-mm thanks, -- John Hubbard NVIDIA