Re: [PATCH v2] mm: fix ununiform page status when writing new file with small buffer

Johannes Weiner <hannes@xxxxxxxxxxx> · Mon, 11 Jun 2012 16:38:08 +0200

On Mon, Jun 11, 2012 at 06:46:09AM -0400, KOSAKI Motohiro wrote:
> (6/11/12 6:42 AM), Robin Dong wrote:
> > From: Robin Dong<sanbai@xxxxxxxxxx>
> > 
> > When writing a new file with 2048 bytes buffer, such as write(fd, buffer, 2048), it will
> > call generic_perform_write() twice for every page:
> > 
> > 	write_begin
> > 	mark_page_accessed(page)
> > 	write_end
> > 
> > 	write_begin
> > 	mark_page_accessed(page)
> > 	write_end
> > 
> > The page 1~13th will be added to lru-pvecs in write_begin() and will *NOT* be added to
> > active_list even they have be accessed twice because they are not PageLRU(page).
> > But when page 14th comes, all pages in lru-pvecs will be moved to inactive_list
> > (by __lru_cache_add() ) in first write_begin(), now page 14th *is* PageLRU(page).
> > And after second write_end() only page 14th  will be in active_list.
> > 
> > In Hadoop environment, we do comes to this situation: after writing a file, we find
> > out that only 14th, 28th, 42th... page are in active_list and others in inactive_list. Now
> > kswapd works, shrinks the inactive_list, the file only have 14th, 28th...pages in memory,
> > the readahead request size will be broken to only 52k (13*4k), system's performance falls
> > dramatically.
> > 
> > This problem can also replay by below steps (the machine has 8G memory):
> > 
> > 	1. dd if=/dev/zero of=/test/file.out bs=1024 count=1048576
> > 	2. cat another 7.5G file to /dev/null
> > 	3. vmtouch -m 1G -v /test/file.out, it will show:
> > 
> > 	/test/file.out
> > 	[oooooooooooooooooooOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 187847/262144
> > 
> > 	the 'o' means same pages are in memory but same are not.
> > 
> > 
> > The solution for this problem is simple: the 14th page should be added to lru_add_pvecs
> > before mark_page_accessed() just as other pages.
> > 
> > Signed-off-by: Robin Dong<sanbai@xxxxxxxxxx>
> > Reviewed-by: Minchan Kim<minchan@xxxxxxxxxx>
> > ---
> >   mm/swap.c |    8 +++++++-
> >   1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 4e7e2ec..08e83ad 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -394,13 +394,19 @@ void mark_page_accessed(struct page *page)
> >   }
> >   EXPORT_SYMBOL(mark_page_accessed);
> > 
> > +/*
> > + * Check pagevec space before adding new page into as
> > + * it will prevent ununiform page status in
> > + * mark_page_accessed() after __lru_cache_add()
> > + */
> >   void __lru_cache_add(struct page *page, enum lru_list lru)
> >   {
> >   	struct pagevec *pvec =&get_cpu_var(lru_add_pvecs)[lru];
> > 
> >   	page_cache_get(page);
> > -	if (!pagevec_add(pvec, page))
> > +	if (!pagevec_space(pvec))
> >   		__pagevec_lru_add(pvec, lru);
> > +	pagevec_add(pvec, page);
> >   	put_cpu_var(lru_add_pvecs);
> >   }
> >   EXPORT_SYMBOL(__lru_cache_add);

I agree with the patch, but I'm not too fond of the comment.  Would
this be better perhaps?

"Order of operation is important: flush the pagevec when it's already
full, not when adding the last page, to make sure that last page is
not added to the LRU directly when passed to this function.  Because
mark_page_accessed() (called after this when writing) only activates
pages that are on the LRU, linear writes in subpage chunks would see
every PAGEVEC_SIZE page activated, which is unexpected."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>