> > > And I find the pageout works seem to have some problems with ext4. > > > For example, this can be easily triggered with 10 dd tasks running > > > inside the 100MB limited memcg: > > So journal thread is getting stuck while committing transaction. Most > > likely waiting for some dd thread to stop a transaction so that commit can > > proceed. The processes waiting in start_this_handle() are just secondary > > effect resulting from the first problem. It might be interesting to get > > stack traces of all bloked processes when the journal thread is stuck. > > For completeness of discussion, citing your conclusion on my private > data feed: > > : We enter memcg reclaim from grab_cache_page_write_begin() and are > : waiting in congestion_wait(). Because grab_cache_page_write_begin() is > : called with transaction started, this blocks transaction from > : committing and subsequently blocks all other activity on the > : filesystem. The fact is this isn't new with your patches, just your > : changes or the fact that we are running in a memory constrained cgroup > : make this more visible. Maybe I'm missing some deep FS restrictions, but can this page allocation (and the one in ext4_write_begin) be moved before ext4_journal_start()? So that the page reclaim can throttle the __GFP_WRITE allocations at will. --- fs/ext4/inode.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) --- linux.orig/fs/ext4/inode.c 2012-02-18 00:10:27.000000000 +0800 +++ linux/fs/ext4/inode.c 2012-02-18 00:31:19.000000000 +0800 @@ -2398,38 +2398,38 @@ static int ext4_da_write_begin(struct fi if (ext4_nonda_switch(inode->i_sb)) { *fsdata = (void *)FALL_BACK_TO_NONDELALLOC; return ext4_write_begin(file, mapping, pos, len, flags, pagep, fsdata); } *fsdata = (void *)0; trace_ext4_da_write_begin(inode, pos, len, flags); retry: + page = grab_cache_page_write_begin(mapping, index, flags); + if (!page) { + ret = -ENOMEM; + goto out; + } + *pagep = page; + /* * With delayed allocation, we don't log the i_disksize update * if there is delayed block allocation. But we still need * to journalling the i_disksize update if writes to the end * of file which has an already mapped buffer. */ handle = ext4_journal_start(inode, 1); if (IS_ERR(handle)) { ret = PTR_ERR(handle); + unlock_page(page); + page_cache_release(page); + if (pos + len > inode->i_size) + truncate_inode_pages(inode->i_mapping, inode->i_size); goto out; } - /* We cannot recurse into the filesystem as the transaction is already - * started */ - flags |= AOP_FLAG_NOFS; - - page = grab_cache_page_write_begin(mapping, index, flags); - if (!page) { - ext4_journal_stop(handle); - ret = -ENOMEM; - goto out; - } - *pagep = page; ret = __block_write_begin(page, pos, len, ext4_da_get_block_prep); if (ret < 0) { unlock_page(page); ext4_journal_stop(handle); page_cache_release(page); /* * block_write_begin may have instantiated a few blocks -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>