On Tue, Jul 30, 2019 at 12:42:07PM -0700, Andrew Morton wrote: > On Mon, 29 Jul 2019 17:20:52 +0900 Minchan Kim <minchan@xxxxxxxxxx> wrote: > > > > > @@ -1022,7 +1023,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > > > > flush_tlb_batched_pending(mm); > > > > arch_enter_lazy_mmu_mode(); > > > > do { > > > > - pte_t ptent = *pte; > > > > + pte_t ptent; > > > > + > > > > + if (progress >= 32) { > > > > + progress = 0; > > > > + if (need_resched()) > > > > + break; > > > > + } > > > > + progress += 8; > > > > > > Why 8? > > > > Just copied from copy_pte_range. > > copy_pte_range() does > > if (pte_none(*src_pte)) { > progress++; > continue; > } > entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, > vma, addr, rss); > if (entry.val) > break; > progress += 8; > > which appears to be an attempt to balance the cost of copy_one_pte() > against the cost of not calling copy_one_pte(). > Indeed. > Your code doesn't do this balancing and hence can be simpler. Based on the balancing code of copy_one_pte, it seems we should balance it with cost of mark_page_accessed against the cost of not calling mark_page_accessed. IOW, add up 8 only when mark_page_accessed is called. However, every mark_page_accessed is not heavy since it uses pagevec and caller couldn't know whether the target page will be activated or just have PG_referenced which is cheap. Thus, I agree, do not make it complicated. > > It all seems a bit overdesigned. need_resched() is cheap. It's > possibly a mistake to check need_resched() on *every* loop because some > crazy scheduling load might livelock us. But surely it would be enough > to do something like > > if (progress++ && need_resched()) { > <reschedule> > progress = 0; > } > > and leave it at that? Seems like this? >From bb1d7aaf520e98a6f9d988c25121602c28e12e67 Mon Sep 17 00:00:00 2001 From: Minchan Kim <minchan@xxxxxxxxxx> Date: Mon, 29 Jul 2019 15:28:48 +0900 Subject: [PATCH] mm: release the spinlock on zap_pte_range In our testing(carmera recording), Miguel and Wei found unmap_page_range takes above 6ms with preemption disabled easily. When I see that, the reason is it holds page table spinlock during entire 512 page operation in a PMD. 6.2ms is never trivial for user experince if RT task couldn't run in the time because it could make frame drop or glitch audio problem. I had a time to benchmark it via adding some trace_printk hooks between pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing device is 2018 premium mobile device. I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the task runs on little core even though it doesn't have any IPI and LRU lock contention. It's already too heavy. If I remove activate_page, 35-40% overhead of zap_pte_range is gone so most of overhead(about 0.7ms) comes from activate_page via mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could accumulate up to several ms. Thus, this patch adds preemption point for once every 32 times in the loop. Reported-by: Miguel de Dios <migueldedios@xxxxxxxxxx> Reported-by: Wei Wang <wvw@xxxxxxxxxx> Cc: Michal Hocko <mhocko@xxxxxxxxxx> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx> --- mm/memory.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 2e796372927fd..8bfcef09da674 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1007,6 +1007,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, struct zap_details *details) { struct mm_struct *mm = tlb->mm; + int progress = 0; int force_flush = 0; int rss[NR_MM_COUNTERS]; spinlock_t *ptl; @@ -1022,7 +1023,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); do { - pte_t ptent = *pte; + pte_t ptent; + + if (progress++ >= 32) { + progress = 0; + if (need_resched()) + break; + } + + ptent = *pte; if (pte_none(ptent)) continue; @@ -1123,8 +1132,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (force_flush) { force_flush = 0; tlb_flush_mmu(tlb); - if (addr != end) - goto again; + } + + if (addr != end) { + progress = 0; + goto again; } return addr; -- 2.22.0.709.g102302147b-goog