Mel, On Apr 25, 2013, at 17:30, Mel Gorman wrote: > On Wed, Apr 24, 2013 at 10:26:50AM -0400, Theodore Ts'o wrote: >> On Tue, Apr 23, 2013 at 03:00:08PM -0700, Andrew Morton wrote: >>> That should fix things for now. Although it might be better to just do >>> >>> mark_page_accessed(page); /* to SetPageReferenced */ >>> lru_add_drain(); /* to SetPageLRU */ >>> >>> Because a) this was too early to decide that the page is >>> super-important and b) the second touch of this page should have a >>> mark_page_accessed() in it already. >> >> The question is do we really want to put lru_add_drain() into the ext4 >> file system code? That seems to pushing some fairly mm-specific >> knowledge into file system code. I'll do this if I have to do, but >> wouldn't be better if this was pushed into mark_page_accessed(), or >> some other new API was exported by the mm subsystem? >> > > I don't think we want to push lru_add_drain() into the ext4 code. It's > too specific of knowledge just to work around pagevecs. Before we rework > how pagevecs select what LRU to place a page, can we make sure that fixing > that will fix the problem? > what is "that"? puting lru_add_drain() in ext4 core? sure that is fixes problem with many small reads during large write. originally i have put shake_page() in ext4 code, but that have call lru_add_drain_all() so to exaggerated. Index: linux-stage/fs/ext4/mballoc.c =================================================================== --- linux-stage.orig/fs/ext4/mballoc.c 2013-03-19 10:55:52.000000000 +0200 +++ linux-stage/fs/ext4/mballoc.c 2013-03-19 10:59:02.000000000 +0200 @@ -900,8 +900,11 @@ static int ext4_mb_init_cache(struct pag incore = data; } } - if (likely(err == 0)) + if (likely(err == 0)) { SetPageUptodate(page); + /* make sure it's in active list */ + mark_page_accessed(page); + } out: if (bh) { @@ -957,6 +960,8 @@ int ext4_mb_init_group(struct super_bloc page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); if (page) { BUG_ON(page->mapping != inode->i_mapping); + /* move to lru - should be lru_add_drain() */ + shake_page(page, 0); ret = ext4_mb_init_cache(page, NULL); if (ret) { unlock_page(page); @@ -986,6 +991,8 @@ int ext4_mb_init_group(struct super_bloc unlock_page(page); } else if (page) { BUG_ON(page->mapping != inode->i_mapping); + /* move to lru - should be lru_add_drain() */ + shake_page(page, 0); ret = ext4_mb_init_cache(page, bitmap); if (ret) { unlock_page(page); @@ -1087,6 +1094,7 @@ repeat_load_buddy: if (page) { BUG_ON(page->mapping != inode->i_mapping); if (!PageUptodate(page)) { + shake_page(page, 0); ret = ext4_mb_init_cache(page, NULL); if (ret) { unlock_page(page); @@ -1118,6 +1126,7 @@ repeat_load_buddy: if (page) { BUG_ON(page->mapping != inode->i_mapping); if (!PageUptodate(page)) { + shake_page(page, 0); ret = ext4_mb_init_cache(page, e4b->bd_bitmap); if (ret) { unlock_page(page); @@ -2500,6 +2509,8 @@ static int ext4_mb_init_backend(struct s * not in the inode hash, so it should never be found by iget(), but * this will avoid confusion if it ever shows up during debugging. */ sbi->s_buddy_cache->i_ino = EXT4_BAD_INO; + sbi->s_buddy_cache->i_state = I_NEW; +// mapping_set_unevictable(sbi->s_buddy_cache->i_mapping); EXT4_I(sbi->s_buddy_cache)->i_disksize = 0; for (i = 0; i < ngroups; i++) { desc = ext4_get_group_desc(sb, i, NULL); additional i_state = I_NEW need to prevent kill page cache from sysctl -w vm.drop_caches=3 > Andrew, can you try the following patch please? Also, is there any chance > you can describe in more detail what the workload does? lustre OSS node + IOR with file size twice more then OSS memory. > If it fails to boot, > remove the second that calls lru_add_drain_all() and try again. well, i will try. > > The patch looks deceptively simple, a downside from is is that workloads that > call mark_page_accessed() frequently will contend more on the zone->lru_lock > than it did previously. Moving lru_add_drain() to the ext4 could would > suffer the same contention problem. NO, isn't. we have call lru_add_drain() in new page allocation case, but mark_page_accessed called without differences - is page in page cache already or it's new allocated - so we have very small zone->lru_lock contention. > > Thanks. > > ---8<--- > mm: pagevec: Move inactive pages to active lists even if on a pagevec > > If a page is on a pagevec aimed at the inactive list then two subsequent > calls to mark_page_acessed() will still not move it to the active list. > This can cause a page to be reclaimed sooner than is expected. This > patch detects if an inactive page is not on the LRU and drains the > pagevec before promoting it. > > Not-signed-off > > diff --git a/mm/swap.c b/mm/swap.c > index 8a529a0..eac64fe 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -437,7 +437,18 @@ void activate_page(struct page *page) > void mark_page_accessed(struct page *page) > { > if (!PageActive(page) && !PageUnevictable(page) && > - PageReferenced(page) && PageLRU(page)) { > + PageReferenced(page)) { > + /* Page could be in pagevec */ > + if (!PageLRU(page)) > + lru_add_drain(); > + > + /* > + * Weeeee, using in_atomic() like this is a hand-grenade. > + * Patch is for debugging purposes only, do not merge this. > + */ > + if (!PageLRU(page) && !in_atomic()) > + lru_add_drain_all(); > + > activate_page(page); > ClearPageReferenced(page); > } else if (!PageReferenced(page)) { -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html