On Fri, Feb 10, 2012 at 07:47:06PM +0800, Wu Fengguang wrote: > On Thu, Feb 09, 2012 at 09:51:31PM -0800, Greg Thelen wrote: > > Have you encountered situations where it's desirable to have more than > > 20% dirty ratio? I imagine that if the dirty working set is larger > > than 20% increasing dirty ratio would prevent rewrites. > One may need to dirty some 40% sized in-memory data set and don't want > to be throttled and trigger lots of I/O. In this case increasing the > dirty ratio to 40% will do the job. > But if there is another job doing heavy dirtying, that job will eat up > the global 40% dirty limit and heavily impact the above job. This is > one case the memcg dirty ratio can help a lot. > > > Leaking dirty memory to a root global dirty pool is concerning. I > > suspect that under some conditions such pages may remain remain in > > root after writeback indefinitely as clean pages. I admit this may > > not be the common case, but having such leaks into root can allow low > > priority jobs access entire machine denying service to higher priority > > jobs. > > You are right. DoS can be achieved by > > loop { > dirty one more page > access all previously dirtied pages > } So there are situations that prefer the dirty pages to be strictly contained within the memcg. For these use cases it looks worthwhile to improve the page reclaim algorithms to handle the 100% dirty zone well. I'd regard this as a much more promising direction than memcg dirty ratio, because efforts on this is going to benefit the general kernel as a whole. The below patch aims to be the first step towards the goal. It turns out to work pretty well for avoiding OOM, with reasonably good I/O throughput and low CPU overheads. Hopefully the page reclaim can be further improved to make the 100% dirty zone a seriously supported and well performed case. Thanks, Fengguang --- Subject: writeback: introduce the pageout work Date: Thu Jul 29 14:41:19 CST 2010 This relays file pageout IOs to the flusher threads. The ultimate target is to gracefully handle the LRU lists full of dirty/writeback pages. 1) I/O efficiency The flusher will piggy back the around 1MB dirty pages for I/O (XXX: make the chunk size adaptive to the bdi write bandwidth). This takes advantage of the time/spacial locality in most workloads: the nearby pages of one file are typically populated into the LRU at the same time, hence will likely be close to each other in the LRU list. Writing them in one shot helps clean more pages effectively for page reclaim. 2) OOM avoidance and scan rate control Typically we do LRU scan w/o rate control and quickly get enough clean pages for the LRU lists not full of dirty pages. Or we can still get a number of freshly cleaned pages (moved to LRU tail by end_page_writeback()) when the queued pageout I/O is completed within tens of milli-seconds. However if the LRU list is small and full of dirty pages, it can be quickly fully scanned and go OOM before the flusher manages to clean enough pages. Here a simple yet reliable scheme is employed to avoid OOM and keep scan rate in sync with the I/O rate: if (PageReclaim(page)) congestion_wait(); PG_reclaim plays the key role. When dirty pages are encountered, we queue I/O for it, set PG_reclaim and put it back to the LRU head. So if PG_reclaim pages are encountered again, it means the dirty page has not yet been cleaned by the flusher after a full zone scan. It indicates we are scanning more fast than I/O and shall take a snap. The runtime behavior on a fully dirtied small LRU list would be: It will start with a quick scan of the list, queuing all pages for I/O. Then the scan will be slowed down by the PG_reclaim pages *adaptively* to match the I/O bandwidth. 3) writeback work coordinations To avoid memory allocations at page reclaim, a mempool for struct wb_writeback_work is created. wakeup_flusher_threads() is removed because it can easily delay the more oriented pageout works and even exhaust the mempool reservations. It's often not I/O efficient by submitting writeback works with small ->nr_pages. Background/periodic works will quit automatically (as done in another patch), so as to clean the pages under reclaim ASAP. However for now the sync work can still block us for long time. Jan Kara: limit the search scope. Note that the limited search and work pool is not a big problem: 1000 IOs under flight are typically more than enough to saturate the disk. And the overheads of searching in the work list didn't even show up in the perf report. 4) test case Run 2 dd tasks in a 100MB memcg (a very handy test case from Greg Thelen): mkdir /cgroup/x echo 100M > /cgroup/x/memory.limit_in_bytes echo $$ > /cgroup/x/tasks for i in `seq 2` do dd if=/dev/zero of=/fs/f$i bs=1k count=1M & done Before patch, the dd tasks are quickly OOM killed. After patch, they run well with reasonably good performance and overheads: 1073741824 bytes (1.1 GB) copied, 22.2196 s, 48.3 MB/s 1073741824 bytes (1.1 GB) copied, 22.4675 s, 47.8 MB/s iostat -kx 1 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 178.00 0.00 89568.00 1006.38 74.35 417.71 4.80 85.40 sda 0.00 2.00 0.00 191.00 0.00 94428.00 988.77 53.34 219.03 4.34 82.90 sda 0.00 20.00 0.00 196.00 0.00 97712.00 997.06 71.11 337.45 4.77 93.50 sda 0.00 5.00 0.00 175.00 0.00 84648.00 967.41 54.03 316.44 5.06 88.60 sda 0.00 0.00 0.00 186.00 0.00 92432.00 993.89 56.22 267.54 5.38 100.00 sda 0.00 1.00 0.00 183.00 0.00 90156.00 985.31 37.99 325.55 4.33 79.20 sda 0.00 0.00 0.00 175.00 0.00 88692.00 1013.62 48.70 218.43 4.69 82.10 sda 0.00 0.00 0.00 196.00 0.00 97528.00 995.18 43.38 236.87 5.10 100.00 sda 0.00 0.00 0.00 179.00 0.00 88648.00 990.48 45.83 285.43 5.59 100.00 sda 0.00 0.00 0.00 178.00 0.00 88500.00 994.38 28.28 158.89 4.99 88.80 sda 0.00 0.00 0.00 194.00 0.00 95852.00 988.16 32.58 167.39 5.15 100.00 sda 0.00 2.00 0.00 215.00 0.00 105996.00 986.01 41.72 201.43 4.65 100.00 sda 0.00 4.00 0.00 173.00 0.00 84332.00 974.94 50.48 260.23 5.76 99.60 sda 0.00 0.00 0.00 182.00 0.00 90312.00 992.44 36.83 212.07 5.49 100.00 sda 0.00 8.00 0.00 195.00 0.00 95940.50 984.01 50.18 221.06 5.13 100.00 sda 0.00 1.00 0.00 220.00 0.00 108852.00 989.56 40.99 202.68 4.55 100.00 sda 0.00 2.00 0.00 161.00 0.00 80384.00 998.56 37.19 268.49 6.21 100.00 sda 0.00 4.00 0.00 182.00 0.00 90830.00 998.13 50.58 239.77 5.49 100.00 sda 0.00 0.00 0.00 197.00 0.00 94877.00 963.22 36.68 196.79 5.08 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.25 0.00 15.08 33.92 0.00 50.75 0.25 0.00 14.54 35.09 0.00 50.13 0.50 0.00 13.57 32.41 0.00 53.52 0.50 0.00 11.28 36.84 0.00 51.38 0.50 0.00 15.75 32.00 0.00 51.75 0.50 0.00 10.50 34.00 0.00 55.00 0.50 0.00 17.63 27.46 0.00 54.41 0.50 0.00 15.08 30.90 0.00 53.52 0.50 0.00 11.28 32.83 0.00 55.39 0.75 0.00 16.79 26.82 0.00 55.64 0.50 0.00 16.08 29.15 0.00 54.27 0.50 0.00 13.50 30.50 0.00 55.50 0.50 0.00 14.32 35.18 0.00 50.00 0.50 0.00 12.06 33.92 0.00 53.52 0.50 0.00 17.29 30.58 0.00 51.63 0.50 0.00 15.08 29.65 0.00 54.77 0.50 0.00 12.53 29.32 0.00 57.64 0.50 0.00 15.29 31.83 0.00 52.38 The global dd iostat for comparison: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 189.00 0.00 95752.00 1013.25 143.09 684.48 5.29 100.00 sda 0.00 0.00 0.00 208.00 0.00 105480.00 1014.23 143.06 733.29 4.81 100.00 sda 0.00 0.00 0.00 161.00 0.00 81924.00 1017.69 141.71 757.79 6.21 100.00 sda 0.00 0.00 0.00 217.00 0.00 109580.00 1009.95 143.09 749.55 4.61 100.10 sda 0.00 0.00 0.00 187.00 0.00 94728.00 1013.13 144.31 773.67 5.35 100.00 sda 0.00 0.00 0.00 189.00 0.00 95752.00 1013.25 144.14 742.00 5.29 100.00 sda 0.00 0.00 0.00 177.00 0.00 90032.00 1017.31 143.32 656.59 5.65 100.00 sda 0.00 0.00 0.00 215.00 0.00 108640.00 1010.60 142.90 817.54 4.65 100.00 sda 0.00 2.00 0.00 166.00 0.00 83858.00 1010.34 143.64 808.61 6.02 100.00 sda 0.00 0.00 0.00 186.00 0.00 92813.00 997.99 141.18 736.95 5.38 100.00 sda 0.00 0.00 0.00 206.00 0.00 104456.00 1014.14 146.27 729.33 4.85 100.00 sda 0.00 0.00 0.00 213.00 0.00 107024.00 1004.92 143.25 705.70 4.69 100.00 sda 0.00 0.00 0.00 188.00 0.00 95748.00 1018.60 141.82 764.78 5.32 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.51 0.00 11.22 52.30 0.00 35.97 0.25 0.00 10.15 52.54 0.00 37.06 0.25 0.00 5.01 56.64 0.00 38.10 0.51 0.00 15.15 43.94 0.00 40.40 0.25 0.00 12.12 48.23 0.00 39.39 0.51 0.00 11.20 53.94 0.00 34.35 0.26 0.00 9.72 51.41 0.00 38.62 0.76 0.00 9.62 50.63 0.00 38.99 0.51 0.00 10.46 53.32 0.00 35.71 0.51 0.00 9.41 51.91 0.00 38.17 0.25 0.00 10.69 49.62 0.00 39.44 0.51 0.00 12.21 52.67 0.00 34.61 0.51 0.00 11.45 53.18 0.00 34.86 Note that it's data for XFS. ext4 seems to have some problem with the workload: the majority pages are found to be writeback pages, and the flusher ends up blocking on the unconditional wait_on_page_writeback() in write_cache_pages_da() from time to time... XXX: commit NFS unstable pages via write_inode() XXX: the added congestion_wait() may be undesirable in some situations CC: Jan Kara <jack@xxxxxxx> CC: Mel Gorman <mgorman@xxxxxxx> CC: Rik van Riel <riel@xxxxxxxxxx> CC: Greg Thelen <gthelen@xxxxxxxxxx> CC: Minchan Kim <minchan.kim@xxxxxxxxx> Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> --- fs/fs-writeback.c | 165 ++++++++++++++++++++++++++++- include/linux/writeback.h | 4 include/trace/events/writeback.h | 12 +- mm/vmscan.c | 17 +- 4 files changed, 184 insertions(+), 14 deletions(-) --- linux.orig/mm/vmscan.c 2012-02-03 21:42:21.000000000 +0800 +++ linux/mm/vmscan.c 2012-02-11 17:28:54.000000000 +0800 @@ -813,6 +813,8 @@ static unsigned long shrink_page_list(st if (PageWriteback(page)) { nr_writeback++; + if (PageReclaim(page)) + congestion_wait(BLK_RW_ASYNC, HZ/10); /* * Synchronous reclaim cannot queue pages for * writeback due to the possibility of stack overflow @@ -874,12 +876,15 @@ static unsigned long shrink_page_list(st nr_dirty++; /* - * Only kswapd can writeback filesystem pages to - * avoid risk of stack overflow but do not writeback - * unless under significant pressure. + * run into the visited page again: we are scanning + * faster than the flusher can writeout dirty pages */ - if (page_is_file_cache(page) && - (!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) { + if (page_is_file_cache(page) && PageReclaim(page)) { + congestion_wait(BLK_RW_ASYNC, HZ/10); + goto keep_locked; + } + if (page_is_file_cache(page) && mapping && + flush_inode_page(mapping, page, true) >= 0) { /* * Immediately reclaim when written back. * Similar in principal to deactivate_page() @@ -2382,8 +2387,6 @@ static unsigned long do_try_to_free_page */ writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2; if (total_scanned > writeback_threshold) { - wakeup_flusher_threads(laptop_mode ? 0 : total_scanned, - WB_REASON_TRY_TO_FREE_PAGES); sc->may_writepage = 1; } --- linux.orig/fs/fs-writeback.c 2012-02-03 21:42:16.000000000 +0800 +++ linux/fs/fs-writeback.c 2012-02-11 18:24:24.000000000 +0800 @@ -35,12 +35,21 @@ #define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_CACHE_SHIFT - 10)) /* + * When flushing an inode page (for page reclaim), try to piggy back up to + * 1MB nearby pages for IO efficiency. These pages will have good opportunity + * to be in the same LRU list. + */ +#define WRITE_AROUND_PAGES (1024UL >> (PAGE_CACHE_SHIFT - 10)) + +/* * Passed into wb_writeback(), essentially a subset of writeback_control */ struct wb_writeback_work { long nr_pages; struct super_block *sb; unsigned long *older_than_this; + struct inode *inode; + pgoff_t offset; enum writeback_sync_modes sync_mode; unsigned int tagged_writepages:1; unsigned int for_kupdate:1; @@ -65,6 +74,27 @@ struct wb_writeback_work { */ int nr_pdflush_threads; +static mempool_t *wb_work_mempool; + +static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data) +{ + /* + * bdi_flush_inode_range() may be called on page reclaim + */ + if (current->flags & PF_MEMALLOC) + return NULL; + + return kmalloc(sizeof(struct wb_writeback_work), gfp_mask); +} + +static __init int wb_work_init(void) +{ + wb_work_mempool = mempool_create(1024, + wb_work_alloc, mempool_kfree, NULL); + return wb_work_mempool ? 0 : -ENOMEM; +} +fs_initcall(wb_work_init); + /** * writeback_in_progress - determine whether there is writeback in progress * @bdi: the device's backing_dev_info structure. @@ -129,7 +159,7 @@ __bdi_start_writeback(struct backing_dev * This is WB_SYNC_NONE writeback, so if allocation fails just * wakeup the thread for old dirty data writeback */ - work = kzalloc(sizeof(*work), GFP_ATOMIC); + work = mempool_alloc(wb_work_mempool, GFP_NOWAIT); if (!work) { if (bdi->wb.task) { trace_writeback_nowork(bdi); @@ -138,6 +168,7 @@ __bdi_start_writeback(struct backing_dev return; } + memset(work, 0, sizeof(*work)); work->sync_mode = WB_SYNC_NONE; work->nr_pages = nr_pages; work->range_cyclic = range_cyclic; @@ -186,6 +217,114 @@ void bdi_start_background_writeback(stru spin_unlock_bh(&bdi->wb_lock); } +static bool extend_writeback_range(struct wb_writeback_work *work, + pgoff_t offset) +{ + pgoff_t end = work->offset + work->nr_pages; + + if (offset >= work->offset && offset < end) + return true; + + if (work->nr_pages >= 8 * WRITE_AROUND_PAGES) + return false; + + /* the unsigned comparison helps eliminate one compare */ + if (work->offset - offset < WRITE_AROUND_PAGES) { + work->nr_pages += WRITE_AROUND_PAGES; + work->offset -= WRITE_AROUND_PAGES; + return true; + } + + if (offset - end < WRITE_AROUND_PAGES) { + work->nr_pages += WRITE_AROUND_PAGES; + return true; + } + + return false; +} + +/* + * schedule writeback on a range of inode pages. + */ +static struct wb_writeback_work * +bdi_flush_inode_range(struct backing_dev_info *bdi, + struct inode *inode, + pgoff_t offset, + pgoff_t len, + bool wait) +{ + struct wb_writeback_work *work; + + if (!igrab(inode)) + return ERR_PTR(-ENOENT); + + work = mempool_alloc(wb_work_mempool, wait ? GFP_NOIO : GFP_NOWAIT); + if (!work) + return ERR_PTR(-ENOMEM); + + memset(work, 0, sizeof(*work)); + work->sync_mode = WB_SYNC_NONE; + work->inode = inode; + work->offset = offset; + work->nr_pages = len; + work->reason = WB_REASON_PAGEOUT; + + bdi_queue_work(bdi, work); + + return work; +} + +/* + * Called by page reclaim code to flush the dirty page ASAP. Do write-around to + * improve IO throughput. The nearby pages will have good chance to reside in + * the same LRU list that vmscan is working on, and even close to each other + * inside the LRU list in the common case of sequential read/write. + * + * ret > 0: success, found/reused a previous writeback work + * ret = 0: success, allocated/queued a new writeback work + * ret < 0: failed + */ +long flush_inode_page(struct address_space *mapping, + struct page *page, + bool wait) +{ + struct backing_dev_info *bdi = mapping->backing_dev_info; + struct inode *inode = mapping->host; + pgoff_t offset = page->index; + pgoff_t len = 0; + struct wb_writeback_work *work; + long ret = -ENOENT; + + if (unlikely(!inode)) + goto out; + + len = 1; + spin_lock_bh(&bdi->wb_lock); + list_for_each_entry_reverse(work, &bdi->work_list, list) { + if (work->inode != inode) + continue; + if (extend_writeback_range(work, offset)) { + ret = len; + offset = work->offset; + len = work->nr_pages; + break; + } + if (len++ > 100) /* limit search depth */ + break; + } + spin_unlock_bh(&bdi->wb_lock); + + if (ret > 0) + goto out; + + offset = round_down(offset, WRITE_AROUND_PAGES); + len = WRITE_AROUND_PAGES; + work = bdi_flush_inode_range(bdi, inode, offset, len, wait); + ret = IS_ERR(work) ? PTR_ERR(work) : 0; +out: + return ret; +} + /* * Remove the inode from the writeback list it is on. */ @@ -833,6 +972,23 @@ static unsigned long get_nr_dirty_pages( get_nr_dirty_inodes(); } +static long wb_flush_inode(struct bdi_writeback *wb, + struct wb_writeback_work *work) +{ + struct writeback_control wbc = { + .sync_mode = WB_SYNC_NONE, + .nr_to_write = LONG_MAX, + .range_start = work->offset << PAGE_CACHE_SHIFT, + .range_end = (work->offset + work->nr_pages - 1) + << PAGE_CACHE_SHIFT, + }; + + do_writepages(work->inode->i_mapping, &wbc); + iput(work->inode); + + return LONG_MAX - wbc.nr_to_write; +} + static long wb_check_background_flush(struct bdi_writeback *wb) { if (over_bground_thresh(wb->bdi)) { @@ -905,7 +1061,10 @@ long wb_do_writeback(struct bdi_writebac trace_writeback_exec(bdi, work); - wrote += wb_writeback(wb, work); + if (work->inode) + wrote += wb_flush_inode(wb, work); + else + wrote += wb_writeback(wb, work); /* * Notify the caller of completion if this is a synchronous @@ -914,7 +1073,7 @@ long wb_do_writeback(struct bdi_writebac if (work->done) complete(work->done); else - kfree(work); + mempool_free(work, wb_work_mempool); } /* --- linux.orig/include/trace/events/writeback.h 2012-02-10 21:54:14.000000000 +0800 +++ linux/include/trace/events/writeback.h 2012-02-11 16:49:18.000000000 +0800 @@ -23,7 +23,7 @@ #define WB_WORK_REASON \ {WB_REASON_BACKGROUND, "background"}, \ - {WB_REASON_TRY_TO_FREE_PAGES, "try_to_free_pages"}, \ + {WB_REASON_PAGEOUT, "pageout"}, \ {WB_REASON_SYNC, "sync"}, \ {WB_REASON_PERIODIC, "periodic"}, \ {WB_REASON_LAPTOP_TIMER, "laptop_timer"}, \ @@ -45,6 +45,8 @@ DECLARE_EVENT_CLASS(writeback_work_class __field(int, range_cyclic) __field(int, for_background) __field(int, reason) + __field(unsigned long, ino) + __field(unsigned long, offset) ), TP_fast_assign( strncpy(__entry->name, dev_name(bdi->dev), 32); @@ -55,9 +57,11 @@ DECLARE_EVENT_CLASS(writeback_work_class __entry->range_cyclic = work->range_cyclic; __entry->for_background = work->for_background; __entry->reason = work->reason; + __entry->ino = work->inode ? work->inode->i_ino : 0; + __entry->offset = work->offset; ), TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d " - "kupdate=%d range_cyclic=%d background=%d reason=%s", + "kupdate=%d range_cyclic=%d background=%d reason=%s ino=%lu offset=%lu", __entry->name, MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev), __entry->nr_pages, @@ -65,7 +69,9 @@ DECLARE_EVENT_CLASS(writeback_work_class __entry->for_kupdate, __entry->range_cyclic, __entry->for_background, - __print_symbolic(__entry->reason, WB_WORK_REASON) + __print_symbolic(__entry->reason, WB_WORK_REASON), + __entry->ino, + __entry->offset ) ); #define DEFINE_WRITEBACK_WORK_EVENT(name) \ --- linux.orig/include/linux/writeback.h 2012-02-11 09:53:53.000000000 +0800 +++ linux/include/linux/writeback.h 2012-02-11 16:49:36.000000000 +0800 @@ -40,7 +40,7 @@ enum writeback_sync_modes { */ enum wb_reason { WB_REASON_BACKGROUND, - WB_REASON_TRY_TO_FREE_PAGES, + WB_REASON_PAGEOUT, WB_REASON_SYNC, WB_REASON_PERIODIC, WB_REASON_LAPTOP_TIMER, @@ -94,6 +94,8 @@ long writeback_inodes_wb(struct bdi_writ enum wb_reason reason); long wb_do_writeback(struct bdi_writeback *wb, int force_wait); void wakeup_flusher_threads(long nr_pages, enum wb_reason reason); +long flush_inode_page(struct address_space *mapping, struct page *page, + bool wait); /* writeback.h requires fs.h; it, too, is not included from here. */ static inline void wait_on_inode(struct inode *inode) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>