Re: [PATCH] mm: cache largest vma

Davidlohr Bueso <davidlohr@xxxxxx> · Sun, 03 Nov 2013 20:20:10 -0800

On Sun, 2013-11-03 at 11:12 +0100, Ingo Molnar wrote:
> * Davidlohr Bueso <davidlohr@xxxxxx> wrote:
> 
> > While caching the last used vma already does a nice job avoiding
> > having to iterate the rbtree in find_vma, we can improve. After
> > studying the hit rate on a load of workloads and environments,
> > it was seen that it was around 45-50% - constant for a standard
> > desktop system (gnome3 + evolution + firefox + a few xterms),
> > and multiple java related workloads (including Hadoop/terasort),
> > and aim7, which indicates it's better than the 35% value documented
> > in the code.
> > 
> > By also caching the largest vma, that is, the one that contains
> > most addresses, there is a steady 10-15% hit rate gain, putting
> > it above the 60% region. This improvement comes at a very low
> > overhead for a miss. Furthermore, systems with !CONFIG_MMU keep
> > the current logic.
> > 
> > This patch introduces a second mmap_cache pointer, which is just
> > as racy as the first, but as we already know, doesn't matter in
> > this context. For documentation purposes, I have also added the
> > ACCESS_ONCE() around mm->mmap_cache updates, keeping it consistent
> > with the reads.
> > 
> > Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> > Cc: Michel Lespinasse <walken@xxxxxxxxxx>
> > Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > Cc: Mel Gorman <mgorman@xxxxxxx>
> > Cc: Rik van Riel <riel@xxxxxxxxxx>
> > Cc: Guan Xuetao <gxt@xxxxxxxxxxxxxxx>
> > Signed-off-by: Davidlohr Bueso <davidlohr@xxxxxx>
> > ---
> > Please note that nommu and unicore32 arch are *untested*.
> > 
> > I also have a patch on top of this one that caches the most 
> > used vma, which adds another 8-10% hit rate gain, However,
> > since it does add a counter to the vma structure and we have
> > to do more logic in find_vma to keep track, I was hesitant about
> > the overhead. If folks are interested I can send that out as well.
> 
> Would be interesting to see.
> 
> Btw., roughly how many cycles/instructions do we save by increasing the 
> hit rate, in the typical case (for example during a kernel build)?

Good point. The IPC from perf stat doesn't show any difference with or
without the patch -- note that this is probably the least interesting
one as we already get a really nice hit rate with the single mmap_cache.
I have yet to try it on the other workloads.

> 
> That would be important to measure, so that we can get a ballpark figure 
> for the cost/benefit equation.
> 
> >  Documentation/vm/locking                 |  4 +-
> >  arch/unicore32/include/asm/mmu_context.h |  2 +-
> >  include/linux/mm.h                       | 13 ++++++
> >  include/linux/mm_types.h                 | 15 ++++++-
> >  kernel/debug/debug_core.c                | 17 +++++++-
> >  kernel/fork.c                            |  2 +-
> >  mm/mmap.c                                | 68 ++++++++++++++++++++------------
> >  7 files changed, 87 insertions(+), 34 deletions(-)
> > 
> > diff --git a/Documentation/vm/locking b/Documentation/vm/locking
> > index f61228b..b4e8154 100644
> > --- a/Documentation/vm/locking
> > +++ b/Documentation/vm/locking
> > @@ -42,8 +42,8 @@ The rules are:
> >     for mm B.
> >  
> >  The caveats are:
> > -1. find_vma() makes use of, and updates, the mmap_cache pointer hint.
> > -The update of mmap_cache is racy (page stealer can race with other code
> > +1. find_vma() makes use of, and updates, the mmap_cache pointers hint.
> > +The updates of mmap_cache is racy (page stealer can race with other code
> >  that invokes find_vma with mmap_sem held), but that is okay, since it 
> >  is a hint. This can be fixed, if desired, by having find_vma grab the
> >  page_table_lock.
> > diff --git a/arch/unicore32/include/asm/mmu_context.h b/arch/unicore32/include/asm/mmu_context.h
> > index fb5e4c6..38cc7fc 100644
> > --- a/arch/unicore32/include/asm/mmu_context.h
> > +++ b/arch/unicore32/include/asm/mmu_context.h
> > @@ -73,7 +73,7 @@ do { \
> >  		else \
> >  			mm->mmap = NULL; \
> >  		rb_erase(&high_vma->vm_rb, &mm->mm_rb); \
> > -		mm->mmap_cache = NULL; \
> > +		vma_clear_caches(mm);			\
> >  		mm->map_count--; \
> >  		remove_vma(high_vma); \
> >  	} \
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 8b6e55e..2c0f8ed 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1534,8 +1534,21 @@ static inline void mm_populate(unsigned long addr, unsigned long len)
> >  	/* Ignore errors */
> >  	(void) __mm_populate(addr, len, 1);
> >  }
> > +
> > +static inline void vma_clear_caches(struct mm_struct *mm)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < NR_VMA_CACHES; i++)
> > +		mm->mmap_cache[i] = NULL;
> 
> Just curious: does GCC manage to open-code this as two stores of NULL?
> 
> > +}
> >  #else
> >  static inline void mm_populate(unsigned long addr, unsigned long len) {}
> > +
> > +static inline void vma_clear_caches(struct mm_struct *mm)
> 1> +{
> > +	mm->mmap_cache = NULL;
> > +}
> >  #endif
> >  
> >  /* These take the mm semaphore themselves */
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index d9851ee..7f92835 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -322,12 +322,23 @@ struct mm_rss_stat {
> >  	atomic_long_t count[NR_MM_COUNTERS];
> >  };
> >  
> > +
> > +#ifdef CONFIG_MMU
> > +enum {
> > +	VMA_LAST_USED, /* last find_vma result */
> > +	VMA_LARGEST,   /* vma that contains most address */
> > +	NR_VMA_CACHES
> > +};
> > +#endif
> > +
> >  struct kioctx_table;
> >  struct mm_struct {
> >  	struct vm_area_struct * mmap;		/* list of VMAs */
> >  	struct rb_root mm_rb;
> > -	struct vm_area_struct * mmap_cache;	/* last find_vma result */
> > -#ifdef CONFIG_MMU
> > +#ifndef CONFIG_MMU
> > +	struct vm_area_struct *mmap_cache;      /* last find_vma result */
> > +#else
> > +	struct vm_area_struct *mmap_cache[NR_VMA_CACHES];
> 
> I think the CONFIG_MMU assymetry in the data structure is rather ugly.
> 
> Why not make it a single-entry enum in the !CONFIG_MMU case? To the 
> compiler a single-entry array should be the same as a pointer field.
> 
> That would eliminate most of the related #ifdefs AFAICS.

Yes that's a lot better.

> 
> >  	unsigned long (*get_unmapped_area) (struct file *filp,
> >  				unsigned long addr, unsigned long len,
> >  				unsigned long pgoff, unsigned long flags);
> > diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> > index 0506d44..d9d72e4 100644
> > --- a/kernel/debug/debug_core.c
> > +++ b/kernel/debug/debug_core.c
> > @@ -221,13 +221,26 @@ int __weak kgdb_skipexception(int exception, struct pt_regs *regs)
> >   */
> >  static void kgdb_flush_swbreak_addr(unsigned long addr)
> >  {
> > +	struct mm_struct *mm = current->mm;
> >  	if (!CACHE_FLUSH_IS_SAFE)
> >  		return;
> >  
> > -	if (current->mm && current->mm->mmap_cache) {
> > -		flush_cache_range(current->mm->mmap_cache,
> > +#ifdef CONFIG_MMU
> > +	if (mm) {
> > +		int i;
> > +
> > +		for (i = 0; i < NR_VMA_CACHES; i++)
> > +			if (mm->mmap_cache[i])
> > +				flush_cache_range(mm->mmap_cache[i],
> > +						  addr,
> > +						  addr + BREAK_INSTR_SIZE);
> 
> (Nit: please use curly braces for for such multi-line statements.)
> 
> > +	}
> > +#else
> > +	if (mm && mm->mmap_cache) {
> > +		flush_cache_range(mm->mmap_cache,
> >  				  addr, addr + BREAK_INSTR_SIZE);
> >  	}
> > +#endif
> 
> Btw., this #ifdef would be unified with my suggested data structure 
> variant as well.
> 
> >  	/* Force flush instruction cache if it was outside the mm */
> >  	flush_icache_range(addr, addr + BREAK_INSTR_SIZE);
> >  }
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 086fe73..7b92666 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -363,8 +363,8 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> >  
> >  	mm->locked_vm = 0;
> >  	mm->mmap = NULL;
> > -	mm->mmap_cache = NULL;
> >  	mm->map_count = 0;
> > +	vma_clear_caches(mm);
> >  	cpumask_clear(mm_cpumask(mm));
> >  	mm->mm_rb = RB_ROOT;
> >  	rb_link = &mm->mm_rb.rb_node;
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 9d54851..29c3fc0 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -676,14 +676,17 @@ static inline void
> >  __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		struct vm_area_struct *prev)
> >  {
> > +	int i;
> >  	struct vm_area_struct *next;
> >  
> >  	vma_rb_erase(vma, &mm->mm_rb);
> >  	prev->vm_next = next = vma->vm_next;
> >  	if (next)
> >  		next->vm_prev = prev;
> > -	if (mm->mmap_cache == vma)
> > -		mm->mmap_cache = prev;
> > +
> > +	for (i = 0; i < NR_VMA_CACHES; i++)
> > +		if (mm->mmap_cache[i] == vma)
> > +			mm->mmap_cache[i] = prev;
> 
> (Nit: missing curly braces.)
> 
> Also, I don't think setting the cache value back to 'prev' is valid in the 
> VMA_LARGEST case. The likelihood that it's the second largest VMA is 
> remote.
> 
> The right action here would be to set it to NULL.
> 
> For VMA_LAST_USED setting it to 'prev' seems justified.
> 
> >  }
> >  
> >  /*
> > @@ -1972,34 +1975,47 @@ EXPORT_SYMBOL(get_unmapped_area);
> >  /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
> >  struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> >  {
> > +	unsigned long currlen = 0;
> 
> (Nit: I don't think 'currlen' really explains the role of the variable. 
> 'max_len' would be better?)
> 
> > +	struct rb_node *rb_node;
> >  	struct vm_area_struct *vma = NULL;
> >  
> > -	/* Check the cache first. */
> > -	/* (Cache hit rate is typically around 35%.) */
> > -	vma = ACCESS_ONCE(mm->mmap_cache);
> > -	if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {
> > -		struct rb_node *rb_node;
> > +	/* Check the cache first */
> > +	vma = ACCESS_ONCE(mm->mmap_cache[VMA_LAST_USED]);
> > +	if (vma && vma->vm_end > addr && vma->vm_start <= addr)
> > +		goto ret;
> >  
> > -		rb_node = mm->mm_rb.rb_node;
> > -		vma = NULL;
> > +	vma = ACCESS_ONCE(mm->mmap_cache[VMA_LARGEST]);
> > +	if (vma) {
> > +		if (vma->vm_end > addr && vma->vm_start <= addr)
> > +			goto ret;
> > +		currlen = vma->vm_end - vma->vm_start;
> > +	}
> >  
> > -		while (rb_node) {
> > -			struct vm_area_struct *vma_tmp;
> > -
> > -			vma_tmp = rb_entry(rb_node,
> > -					   struct vm_area_struct, vm_rb);
> > -
> > -			if (vma_tmp->vm_end > addr) {
> > -				vma = vma_tmp;
> > -				if (vma_tmp->vm_start <= addr)
> > -					break;
> > -				rb_node = rb_node->rb_left;
> > -			} else
> > -				rb_node = rb_node->rb_right;
> > -		}
> > -		if (vma)
> > -			mm->mmap_cache = vma;
> > +	/* Bad cache! iterate rbtree */
> 
> (Nit: the cache is not 'bad', we just didn't hit it.)
> 
> > +	rb_node = mm->mm_rb.rb_node;
> > +	vma = NULL;
> > +
> > +	while (rb_node) {
> > +		struct vm_area_struct *vma_tmp;
> > +
> > +		vma_tmp = rb_entry(rb_node,
> > +				   struct vm_area_struct, vm_rb);
> 
> (Nit: in such cases a single, slightly-longer-than-80col line is IMHO a 
> better solution than such an artificial line-break.)
> 
> > +
> > +		if (vma_tmp->vm_end > addr) {
> > +			vma = vma_tmp;
> > +			if (vma_tmp->vm_start <= addr)
> > +				break;
> > +			rb_node = rb_node->rb_left;
> > +		} else
> > +			rb_node = rb_node->rb_right;
> 
> (Nit: unbalanced curly braces.)
> 
> > +	}
> > +
> > +	if (vma) {
> > +		ACCESS_ONCE(mm->mmap_cache[VMA_LAST_USED]) = vma;
> > +		if (vma->vm_end - vma->vm_start > currlen)
> > +			ACCESS_ONCE(mm->mmap_cache[VMA_LARGEST]) = vma;
> 
> Would it make sense to not update VMA_LAST_USED if VMA_LARGEST is set?
> 
> This would have the advantage of increasing the cache size to two, for the 
> common case where there's two vmas used most of the time.
> 
> To maximize the hit rate in the general case what we basically want to 
> have is an LRU cache, weighted by vma size.
> 
> Maybe by expressing it all in that fashion and looking at the hit rate at 
> 1, 2, 3 and 4 entries would give us equivalent (or better!) behavior than 
> your open-coded variant, with a better idea about how to size it 
> precisely.
> 
> Note that that approach would get rid of the VMA_LAST_USED/VMA_LARGEST 
> distinction in a natural fashion.
> 
> Obviously, if the LRU logic gets too complex then it probably won't bring 
> us any benefits compared to a primitive front-entry cache, so all this is 
> a delicate balance ... hence my previous question about 
> cycles/instructions saved by hitting the cache.

Will try this, thanks for the suggestions.

Btw, do you suggest using a high level tool such as perf for getting
this data or sprinkling get_cycles() in find_vma() -- I'd think that the
first isn't fine grained enough, while the later will probably variate a
lot from run to run but the ratio should be rather constant.

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>