> On May 13, 2019, at 9:37 AM, Will Deacon <will.deacon@xxxxxxx> wrote: > > On Mon, May 13, 2019 at 09:11:38AM +0000, Nadav Amit wrote: >>> On May 13, 2019, at 1:36 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: >>> >>> On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote: >>> >>>>>>> And we can fix that by having tlb_finish_mmu() sync up. Never let a >>>>>>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers >>>>>>> have completed. >>>>>>> >>>>>>> This should not be too hard to make happen. >>>>>> >>>>>> This synchronization sounds much more expensive than what I proposed. But I >>>>>> agree that cache-lines that move from one CPU to another might become an >>>>>> issue. But I think that the scheme I suggested would minimize this overhead. >>>>> >>>>> Well, it would have a lot more unconditional atomic ops. My scheme only >>>>> waits when there is actual concurrency. >>>> >>>> Well, something has to give. I didn’t think that if the same core does the >>>> atomic op it would be too expensive. >>> >>> They're still at least 20 cycles a pop, uncontended. >>> >>>>> I _think_ something like the below ought to work, but its not even been >>>>> near a compiler. The only problem is the unconditional wakeup; we can >>>>> play games to avoid that if we want to continue with this. >>>>> >>>>> Ideally we'd only do this when there's been actual overlap, but I've not >>>>> found a sensible way to detect that. >>>>> >>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >>>>> index 4ef4bbe78a1d..b70e35792d29 100644 >>>>> --- a/include/linux/mm_types.h >>>>> +++ b/include/linux/mm_types.h >>>>> @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) >>>>> * >>>>> * Therefore we must rely on tlb_flush_*() to guarantee order. >>>>> */ >>>>> - atomic_dec(&mm->tlb_flush_pending); >>>>> + if (atomic_dec_and_test(&mm->tlb_flush_pending)) { >>>>> + wake_up_var(&mm->tlb_flush_pending); >>>>> + } else { >>>>> + wait_event_var(&mm->tlb_flush_pending, >>>>> + !atomic_read_acquire(&mm->tlb_flush_pending)); >>>>> + } >>>>> } >>>> >>>> It still seems very expensive to me, at least for certain workloads (e.g., >>>> Apache with multithreaded MPM). >>> >>> Is that Apache-MPM workload triggering this lots? Having a known >>> benchmark for this stuff is good for when someone has time to play with >>> things. >> >> Setting Apache2 with mpm_worker causes every request to go through >> mmap-writev-munmap flow on every thread. I didn’t run this workload after >> the patches that downgrade the mmap_sem to read before the page-table >> zapping were introduced. I presume these patches would allow the page-table >> zapping to be done concurrently, and therefore would hit this flow. > > Hmm, I don't think so: munmap() still has to take the semaphore for write > initially, so it will be serialised against other munmap() threads even > after they've downgraded afaict. > > The initial bug report was about concurrent madvise() vs munmap(). I guess you are right (and I’m wrong). Short search suggests that ebizzy might be affected (a thread by Mel Gorman): https://lkml.org/lkml/2015/2/2/493