Re: [PATCH 2/2 v4]mm: batch activate_page() to reduce lock contention

Shaohua Li <shaohua.li@xxxxxxxxx> · Tue, 15 Mar 2011 09:53:34 +0800



On Mon, 2011-03-14 at 22:45 +0800, Minchan Kim wrote:
> On Thu, Mar 10, 2011 at 01:30:19PM +0800, Shaohua Li wrote:
> > The zone->lru_lock is heavily contented in workload where activate_page()
> > is frequently used. We could do batch activate_page() to reduce the lock
> > contention. The batched pages will be added into zone list when the pool
> > is full or page reclaim is trying to drain them.
> > 
> > For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes,
> > processes shared map to the file. Each process read access the whole file and
> > then exit. The process exit will do unmap_vmas() and cause a lot of
> > activate_page() call. In such workload, we saw about 58% total time reduction
> > with below patch. Other workloads with a lot of activate_page also benefits a
> > lot too.
> > 
> > Andrew Morton suggested activate_page() and putback_lru_pages() should
> > follow the same path to active pages, but this is hard to implement (see commit
> > 7a608572a282a). On the other hand, do we really need putback_lru_pages() to
> > follow the same path? I tested several FIO/FFSB benchmark (about 20 scripts for
> > each benchmark) in 3 machines here from 2 sockets to 4 sockets. My test doesn't
> > show anything significant with/without below patch (there is slight difference
> > but mostly some noise which we found even without below patch before). Below
> > patch basically returns to the same as my first post.
> > 
> > I tested some microbenchmarks:
> > case-anon-cow-rand-mt               0.58%
> > case-anon-cow-rand          -3.30%
> > case-anon-cow-seq-mt                -0.51%
> > case-anon-cow-seq           -5.68%
> > case-anon-r-rand-mt         0.23%
> > case-anon-r-rand            0.81%
> > case-anon-r-seq-mt          -0.71%
> > case-anon-r-seq                     -1.99%
> > case-anon-rx-rand-mt                2.11%
> > case-anon-rx-seq-mt         3.46%
> > case-anon-w-rand-mt         -0.03%
> > case-anon-w-rand            -0.50%
> > case-anon-w-seq-mt          -1.08%
> > case-anon-w-seq                     -0.12%
> > case-anon-wx-rand-mt                -5.02%
> > case-anon-wx-seq-mt         -1.43%
> > case-fork                   1.65%
> > case-fork-sleep                     -0.07%
> > case-fork-withmem           1.39%
> > case-hugetlb                        -0.59%
> > case-lru-file-mmap-read-mt  -0.54%
> > case-lru-file-mmap-read             0.61%
> > case-lru-file-mmap-read-rand        -2.24%
> > case-lru-file-readonce              -0.64%
> > case-lru-file-readtwice             -11.69%
> > case-lru-memcg                      -1.35%
> > case-mmap-pread-rand-mt             1.88%
> > case-mmap-pread-rand                -15.26%
> > case-mmap-pread-seq-mt              0.89%
> > case-mmap-pread-seq         -69.72%
> > case-mmap-xread-rand-mt             0.71%
> > case-mmap-xread-seq-mt              0.38%
> > 
> > The most significent are:
> > case-lru-file-readtwice             -11.69%
> > case-mmap-pread-rand                -15.26%
> > case-mmap-pread-seq         -69.72%
> > 
> > which use activate_page a lot.  others are basically variations because
> > each run has slightly difference.
> > 
> > In UP case, 'size mm/swap.o'
> > before the two patches:
> >    text    data     bss     dec     hex filename
> >    6466     896       4    7366    1cc6 mm/swap.o
> > after the two patches:
> >    text    data     bss     dec     hex filename
> >    6343     896       4    7243    1c4b mm/swap.o
> > 
> > Signed-off-by: Shaohua Li <shaohua.li@xxxxxxxxx>
> > 
> > ---
> >  mm/swap.c |   45 ++++++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 40 insertions(+), 5 deletions(-)
> > 
> > Index: linux/mm/swap.c
> > ===================================================================
> > --- linux.orig/mm/swap.c	2011-03-09 12:56:09.000000000 +0800
> > +++ linux/mm/swap.c	2011-03-09 12:56:46.000000000 +0800
> > @@ -272,14 +272,10 @@ static void update_page_reclaim_stat(str
> >  		memcg_reclaim_stat->recent_rotated[file]++;
> >  }
> >  
> > -/*
> > - * FIXME: speed this up?
> > - */
> > -void activate_page(struct page *page)
> > +static void __activate_page(struct page *page, void *arg)
> >  {
> >  	struct zone *zone = page_zone(page);
> >  
> > -	spin_lock_irq(&zone->lru_lock);
> >  	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> >  		int file = page_is_file_cache(page);
> >  		int lru = page_lru_base_type(page);
> > @@ -292,8 +288,45 @@ void activate_page(struct page *page)
> >  
> >  		update_page_reclaim_stat(zone, page, file, 1);
> >  	}
> > +}
> > +
> > +#ifdef CONFIG_SMP
> > +static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
> > +
> > +static void activate_page_drain(int cpu)
> > +{
> > +	struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu);
> > +
> > +	if (pagevec_count(pvec))
> > +		pagevec_lru_move_fn(pvec, __activate_page, NULL);
> > +}
> > +
> > +void activate_page(struct page *page)
> > +{
> > +	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> > +		struct pagevec *pvec = &get_cpu_var(activate_page_pvecs);
> > +
> > +		page_cache_get(page);
> > +		if (!pagevec_add(pvec, page))
> > +			pagevec_lru_move_fn(pvec, __activate_page, NULL);
> > +		put_cpu_var(activate_page_pvecs);
> > +	}
> > +}
> > +
> > +#else
> > +static inline void activate_page_drain(int cpu)
> > +{
> > +}
> > +
> > +void activate_page(struct page *page)
> > +{
> > +	struct zone *zone = page_zone(page);
> > +
> > +	spin_lock_irq(&zone->lru_lock);
> > +	__activate_page(page, NULL);
> >  	spin_unlock_irq(&zone->lru_lock);
> >  }
> > +#endif
>  
> Why do we need CONFIG_SMP in only activate_page_pvecs?
> The per-cpu of activate_page_pvecs consumes lots of memory in UP?
> I don't think so. But if it consumes lots of memory, it's a problem
> of per-cpu. 
No, not too much memory.

> I can't understand why we should hanlde activate_page_pvecs specially.
> Please, enlighten me. 
Not it's special. akpm asked me to do it this time. Reducing little
memory is still worthy anyway, so that's it. We can do it for other
pvecs too, in separate patch.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>