2012/6/11 KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxx>: > (6/11/12 6:42 AM), Robin Dong wrote: >> From: Robin Dong<sanbai@xxxxxxxxxx> >> >> When writing a new file with 2048 bytes buffer, such as write(fd, buffer, 2048), it will >> call generic_perform_write() twice for every page: >> >> write_begin >> mark_page_accessed(page) >> write_end >> >> write_begin >> mark_page_accessed(page) >> write_end >> >> The page 1~13th will be added to lru-pvecs in write_begin() and will *NOT* be added to >> active_list even they have be accessed twice because they are not PageLRU(page). >> But when page 14th comes, all pages in lru-pvecs will be moved to inactive_list >> (by __lru_cache_add() ) in first write_begin(), now page 14th *is* PageLRU(page). >> And after second write_end() only page 14th will be in active_list. >> >> In Hadoop environment, we do comes to this situation: after writing a file, we find >> out that only 14th, 28th, 42th... page are in active_list and others in inactive_list. Now >> kswapd works, shrinks the inactive_list, the file only have 14th, 28th...pages in memory, >> the readahead request size will be broken to only 52k (13*4k), system's performance falls >> dramatically. >> >> This problem can also replay by below steps (the machine has 8G memory): >> >> 1. dd if=/dev/zero of=/test/file.out bs=1024 count=1048576 >> 2. cat another 7.5G file to /dev/null >> 3. vmtouch -m 1G -v /test/file.out, it will show: >> >> /test/file.out >> [oooooooooooooooooooOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 187847/262144 >> >> the 'o' means same pages are in memory but same are not. >> >> >> The solution for this problem is simple: the 14th page should be added to lru_add_pvecs >> before mark_page_accessed() just as other pages. >> >> Signed-off-by: Robin Dong<sanbai@xxxxxxxxxx> >> Reviewed-by: Minchan Kim<minchan@xxxxxxxxxx> >> --- >> mm/swap.c | 8 +++++++- >> 1 file changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/mm/swap.c b/mm/swap.c >> index 4e7e2ec..08e83ad 100644 >> --- a/mm/swap.c >> +++ b/mm/swap.c >> @@ -394,13 +394,19 @@ void mark_page_accessed(struct page *page) >> } >> EXPORT_SYMBOL(mark_page_accessed); >> >> +/* >> + * Check pagevec space before adding new page into as >> + * it will prevent ununiform page status in >> + * mark_page_accessed() after __lru_cache_add() >> + */ >> void __lru_cache_add(struct page *page, enum lru_list lru) >> { >> struct pagevec *pvec =&get_cpu_var(lru_add_pvecs)[lru]; >> >> page_cache_get(page); >> - if (!pagevec_add(pvec, page)) >> + if (!pagevec_space(pvec)) >> __pagevec_lru_add(pvec, lru); >> + pagevec_add(pvec, page); >> put_cpu_var(lru_add_pvecs); >> } >> EXPORT_SYMBOL(__lru_cache_add); > > No change from v1? > Adding function comment from Minchan Kim's suggestion. I know that the best solution may be removing all pagevecs completely, as you say, but removing pagevecs would be a very very long-term subject (I guess) because many developers will argue it again and again before coming to compromise. I don't think I have the power to make a so big change, so...."hacking" the __lur_cache_add would be a good solution recently, at least in many Hadoop Clusters :) -- -- Best Regard Robin Dong -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href