On Fri, May 27, 2011 at 8:58 AM, Minchan Kim <minchan.kim@xxxxxxxxx> wrote: > On Thu, May 26, 2011 at 5:17 AM, Andrew Lutomirski <luto@xxxxxxx> wrote: >> On Tue, May 24, 2011 at 8:43 PM, KOSAKI Motohiro >> <kosaki.motohiro@xxxxxxxxxxxxxx> wrote: >>> >>> Unfortnately, this log don't tell us why DM don't issue any swap io. ;-) >>> I doubt it's DM issue. Can you please try to make swap on out of DM? >>> >>> >> >> I can do one better: I can tell you how to reproduce the OOM in the >> comfort of your own VM without using dm_crypt or a Sandy Bridge >> laptop. ÂThis is on Fedora 15, but it really ought to work on any >> x86_64 distribution that has kvm. ÂYou'll probably want at least 6GB >> on your host machine because the VM wants 4GB ram. >> >> Here's how: >> >> Step 1: Clone git://gitorious.org/linux-test-utils/reproduce-annoying-mm-bug.git >> >> (You can browse here:) >> https://gitorious.org/linux-test-utils/reproduce-annoying-mm-bug >> >> Instructions to reproduce the mm bug: >> >> Step 2: Build Linux v2.6.38.6 with config-2.6.38.6 and the patch >> 0001-Minchan-patch-for-testing-23-05-2011.patch (both files are in the >> git repo) >> >> Step 3: cd back to reproduce-annoying-mm-bug >> >> Step 4: Type this. >> >> $ make >> $ qemu-kvm -m 4G -smp 2 -kernel <linux_dir>/arch/x86/boot/bzImage >> -initrd initramfs.gz >> >> Step 5: Wait for the VM to boot (it's really fast) and then run ./repro_bug.sh. >> >> Step 6: Wait a bit and watch the fireworks. ÂNote that it can take a >> couple minutes to reproduce the bug. >> >> Tested on my Sandy Bridge laptop and on a Xeon W3520. >> >> For whatever reason, on my laptop without the VM I can hit the bug >> almost instantaneously. ÂMaybe it's because I'm using dm-crypt on my >> laptop. >> >> --Andy >> >> P.S. ÂI think that the mk_trivial_initramfs.sh script is cute, and > > That's cool. :) > >> maybe I'll try to flesh it out and turn it into a real project some >> day. >> > > Thanks for good test environment. > Yesterday, I tried to reproduce your problem in my system(4G DRAM) but > unfortunately got failed. I tried various setting but I can't reach. > Maybe I need 8G system or sandy-bridge. Â:( > > Hi mm folks, It's next round. > Andrew Lutomirski's first problem, kswapd hang problem was solved by > recent Mel's series(!pgdat_balanced bug and shrink_slab cond_resched) > which is key for James, Collins problem. > > Andrew's next problem is a early OOM kill. > > [ Â 60.627550] cryptsetup invoked oom-killer: gfp_mask=0x201da, > order=0, oom_adj=0, oom_score_adj=0 > [ Â 60.627553] cryptsetup cpuset=/ mems_allowed=0 > [ Â 60.627555] Pid: 1910, comm: cryptsetup Not tainted 2.6.38.6-no-fpu+ #47 > [ Â 60.627556] Call Trace: > [ Â 60.627563] Â[<ffffffff8107f9c5>] ? cpuset_print_task_mems_allowed+0x91/0x9c > [ Â 60.627567] Â[<ffffffff810b3ef1>] ? dump_header+0x7f/0x1ba > [ Â 60.627570] Â[<ffffffff8109e4d6>] ? trace_hardirqs_on+0x9/0x20 > [ Â 60.627572] Â[<ffffffff810b42ba>] ? oom_kill_process+0x50/0x24e > [ Â 60.627574] Â[<ffffffff810b4961>] ? out_of_memory+0x2e4/0x359 > [ Â 60.627576] Â[<ffffffff810b879e>] ? __alloc_pages_nodemask+0x5f3/0x775 > [ Â 60.627579] Â[<ffffffff810e127e>] ? alloc_pages_current+0xbe/0xd8 > [ Â 60.627581] Â[<ffffffff810b2126>] ? __page_cache_alloc+0x77/0x7e > [ Â 60.627585] Â[<ffffffff8135d009>] ? dm_table_unplug_all+0x52/0xed > [ Â 60.627587] Â[<ffffffff810b9f74>] ? __do_page_cache_readahead+0x98/0x1a4 > [ Â 60.627589] Â[<ffffffff810ba321>] ? ra_submit+0x21/0x25 > [ Â 60.627590] Â[<ffffffff810ba4ee>] ? ondemand_readahead+0x1c9/0x1d8 > [ Â 60.627592] Â[<ffffffff810ba5dd>] ? page_cache_sync_readahead+0x3d/0x40 > [ Â 60.627594] Â[<ffffffff810b342d>] ? filemap_fault+0x119/0x36c > [ Â 60.627597] Â[<ffffffff810caf5f>] ? __do_fault+0x56/0x342 > [ Â 60.627600] Â[<ffffffff810f5630>] ? lookup_page_cgroup+0x32/0x48 > [ Â 60.627602] Â[<ffffffff810cd437>] ? handle_pte_fault+0x29f/0x765 > [ Â 60.627604] Â[<ffffffff810ba75e>] ? add_page_to_lru_list+0x6e/0x73 > [ Â 60.627606] Â[<ffffffff810be487>] ? page_evictable+0x1b/0x8d > [ Â 60.627607] Â[<ffffffff810bae36>] ? put_page+0x24/0x35 > [ Â 60.627610] Â[<ffffffff810cdbfc>] ? handle_mm_fault+0x18e/0x1a1 > [ Â 60.627612] Â[<ffffffff810cded2>] ? __get_user_pages+0x2c3/0x3ed > [ Â 60.627614] Â[<ffffffff810cfb4b>] ? __mlock_vma_pages_range+0x67/0x6b > [ Â 60.627616] Â[<ffffffff810cfc01>] ? do_mlock_pages+0xb2/0x11a > [ Â 60.627618] Â[<ffffffff810d0448>] ? sys_mlockall+0x111/0x11c > [ Â 60.627621] Â[<ffffffff81002a3b>] ? system_call_fastpath+0x16/0x1b > [ Â 60.627623] Mem-Info: > [ Â 60.627624] Node 0 DMA per-cpu: > [ Â 60.627626] CPU Â Â0: hi: Â Â0, btch: Â 1 usd: Â 0 > [ Â 60.627627] CPU Â Â1: hi: Â Â0, btch: Â 1 usd: Â 0 > [ Â 60.627628] CPU Â Â2: hi: Â Â0, btch: Â 1 usd: Â 0 > [ Â 60.627629] CPU Â Â3: hi: Â Â0, btch: Â 1 usd: Â 0 > [ Â 60.627630] Node 0 DMA32 per-cpu: > [ Â 60.627631] CPU Â Â0: hi: Â186, btch: Â31 usd: Â 0 > [ Â 60.627633] CPU Â Â1: hi: Â186, btch: Â31 usd: Â 0 > [ Â 60.627634] CPU Â Â2: hi: Â186, btch: Â31 usd: Â 0 > [ Â 60.627635] CPU Â Â3: hi: Â186, btch: Â31 usd: Â 0 > [ Â 60.627638] active_anon:51586 inactive_anon:17384 isolated_anon:0 > [ Â 60.627639] Âactive_file:0 inactive_file:226 isolated_file:0 > [ Â 60.627639] Âunevictable:395661 dirty:0 writeback:3 unstable:0 > [ Â 60.627640] Âfree:13258 slab_reclaimable:3979 slab_unreclaimable:9755 > [ Â 60.627640] Âmapped:11910 shmem:24046 pagetables:5062 bounce:0 > [ Â 60.627642] Node 0 DMA free:8352kB min:340kB low:424kB high:508kB > active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:952kB > unevictable:6580kB isolated(anon):0kB isolated(file):0kB > present:15676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB > shmem:0kB slab_reclaimable:16kB slab_unreclaimable:0kB > kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB > writeback_tmp:0kB pages_scanned:1645 all_unreclaimable? yes > [ Â 60.627649] lowmem_reserve[]: 0 2004 2004 2004 > [ Â 60.627651] Node 0 DMA32 free:44680kB min:44712kB low:55888kB > high:67068kB active_anon:206344kB inactive_anon:69536kB > active_file:0kB inactive_file:0kB unevictable:1576064kB > isolated(anon):0kB isolated(file):0kB present:2052320kB > mlocked:47540kB dirty:0kB writeback:12kB mapped:47640kB shmem:96184kB > slab_reclaimable:15900kB slab_unreclaimable:39020kB > kernel_stack:2424kB pagetables:20248kB unstable:0kB bounce:0kB > writeback_tmp:0kB pages_scanned:499225 all_unreclaimable? yes > [ Â 60.627658] lowmem_reserve[]: 0 0 0 0 > [ Â 60.627660] Node 0 DMA: 0*4kB 0*8kB 2*16kB 2*32kB 1*64kB 2*128kB > 1*256kB 1*512kB 1*1024kB 3*2048kB 0*4096kB = 8352kB > [ Â 60.627665] Node 0 DMA32: 959*4kB 2071*8kB 682*16kB 165*32kB > 27*64kB 4*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 1*4096kB = 44980kB > [ Â 60.627670] 419957 total pagecache pages > [ Â 60.627671] 0 pages in swap cache > [ Â 60.627672] Swap cache stats: add 137, delete 137, find 0/0 > [ Â 60.627673] Free swap Â= 6290904kB > [ Â 60.627674] Total swap = 6291452kB > [ Â 60.632560] 524272 pages RAM > [ Â 60.632562] 9451 pages reserved > [ Â 60.632563] 45558 pages shared > [ Â 60.632564] 469944 pages non-shared > > > There are about 270M anon Âand lots of free swap space in system. > Nonetheless, he saw the OOM. I think it doesn't make sense. > As I look above log, he used swap as crypted device mapper and used 1.4G ramfs. > Andy, Right? > > The thing I doubt firstly was a big ramfs. > I think in reclaim, shrink_page_list will start to cull mlocked page. > If there are so many ramfs pages and working set pages in LRU, > reclaimer can't reclaim any page until It meet non-unevictable pages > or non-working set page(!PG_referenced and !pte_young). His workload > had lots of anon pages and ramfs pages. ramfs pages is unevictable > page so that it would cull and anon pages are promoted very easily so > that we can't reclaim it easily. > It means zone->pages_scanned would be very high so after all, > zone->all_unreclaimable would set. > As I look above log, the number of lru in ÂDMA32 zone is 68970. > The number of unevictable page is 394016. > > 394016 + working set page(I don't know) is almost equal to Â(68970 * 6 > = 413820). > So it's possible that zone->all_unreclaimable is set. > I wanted to test below patch by private but it doesn't solve his problem. > But I think we need below patch, still. It can happen if we had lots > of LRU order successive mlocked page in LRU. > > === > > From e37f150328aedeea9a88b6190ab2b6e6c1067163 Mon Sep 17 00:00:00 2001 > From: Minchan Kim <minchan.kim@xxxxxxxxx> > Date: Wed, 25 May 2011 07:09:17 +0900 > Subject: [PATCH 3/3] vmscan: decrease pages_scanned on unevictable page > > If there are many unevictable pages on evictable LRU list(ex, big ramfs), > shrink_page_list will move it into unevictable and can't reclaim pages. > But we already increased zone->pages_scanned. > If the situation is repeated, the number of evictable lru pages is decreased > while zone->pages_scanned is increased without reclaim any pages. > It could turn on zone->all_unreclaimable but it's totally false alram. > > Signed-off-by: Minchan Kim <minchan.kim@xxxxxxxxx> > --- > Âmm/vmscan.c | Â 22 +++++++++++++++++++--- > Â1 files changed, 19 insertions(+), 3 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 08d3077..a7df813 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -700,7 +700,8 @@ static noinline_for_stack void > free_page_list(struct list_head *free_pages) > Âstatic unsigned long shrink_page_list(struct list_head *page_list, > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âstruct zone *zone, > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âstruct scan_control *sc, > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â unsigned long *dirty_pages) > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â unsigned long *dirty_pages, > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â unsigned long *unevictable_pages) > Â{ > Â Â Â ÂLIST_HEAD(ret_pages); > Â Â Â ÂLIST_HEAD(free_pages); > @@ -708,6 +709,7 @@ static unsigned long shrink_page_list(struct > list_head *page_list, > Â Â Â Âunsigned long nr_dirty = 0; > Â Â Â Âunsigned long nr_congested = 0; > Â Â Â Âunsigned long nr_reclaimed = 0; > + Â Â Â unsigned long nr_unevictable = 0; > > Â Â Â Âcond_resched(); > > @@ -908,6 +910,7 @@ cull_mlocked: > Â Â Â Â Â Â Â Â Â Â Â Âtry_to_free_swap(page); > Â Â Â Â Â Â Â Âunlock_page(page); > Â Â Â Â Â Â Â Âputback_lru_page(page); > + Â Â Â Â Â Â Â nr_unevictable++; > Â Â Â Â Â Â Â Âcontinue; > > Âactivate_locked: > @@ -936,6 +939,7 @@ keep_lumpy: > Â Â Â Â Â Â Â Âzone_set_flag(zone, ZONE_CONGESTED); > > Â Â Â Â*dirty_pages = nr_dirty; > + Â Â Â *unevictable_pages = nr_unevictable; > Â Â Â Âfree_page_list(&free_pages); > > Â Â Â Âlist_splice(&ret_pages, page_list); > @@ -1372,6 +1376,7 @@ shrink_inactive_list(unsigned long nr_to_scan, > struct zone *zone, > Â Â Â Âunsigned long nr_scanned; > Â Â Â Âunsigned long nr_reclaimed = 0; > Â Â Â Âunsigned long nr_dirty; > + Â Â Â unsigned long nr_unevictable; > Â Â Â Âunsigned long nr_taken; > Â Â Â Âunsigned long nr_anon; > Â Â Â Âunsigned long nr_file; > @@ -1425,7 +1430,7 @@ shrink_inactive_list(unsigned long nr_to_scan, > struct zone *zone, > Â Â Â Âspin_unlock_irq(&zone->lru_lock); > > Â Â Â Âreclaim_mode = sc->reclaim_mode; > - Â Â Â nr_reclaimed = shrink_page_list(&page_list, zone, sc, &nr_dirty); > + Â Â Â nr_reclaimed = shrink_page_list(&page_list, zone, sc, &nr_dirty, > &nr_unevictable); > > Â Â Â Â/* Check if we should syncronously wait for writeback */ > Â Â Â Âif ((nr_dirty && !(reclaim_mode & RECLAIM_MODE_SINGLE) && > @@ -1434,7 +1439,8 @@ shrink_inactive_list(unsigned long nr_to_scan, > struct zone *zone, > Â Â Â Â Â Â Â Âunsigned long nr_active = clear_active_flags(&page_list, NULL); > Â Â Â Â Â Â Â Âcount_vm_events(PGDEACTIVATE, nr_active); > Â Â Â Â Â Â Â Âset_reclaim_mode(priority, sc, true); > - Â Â Â Â Â Â Â nr_reclaimed += shrink_page_list(&page_list, zone, sc, &nr_dirty); > + Â Â Â Â Â Â Â nr_reclaimed += shrink_page_list(&page_list, zone, sc, > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â &nr_dirty, &nr_unevictable); > Â Â Â Â} > > Â Â Â Âlocal_irq_disable(); > @@ -1442,6 +1448,16 @@ shrink_inactive_list(unsigned long nr_to_scan, > struct zone *zone, > Â Â Â Â Â Â Â Â__count_vm_events(KSWAPD_STEAL, nr_reclaimed); > Â Â Â Â__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed); > > + Â Â Â /* > + Â Â Â Â* Too many unevictalbe pages on evictable LRU list(ex, big ramfs) > + Â Â Â Â* can make high zone->pages_scanned and reduce the number of lru page > + Â Â Â Â* on evictable lru as reclaim is going on. > + Â Â Â Â* It could turn on all_unreclaimable which is false alarm. > + Â Â Â Â*/ > + Â Â Â spin_lock(&zone->lru_lock); > + Â Â Â if (zone->pages_scanned >= nr_unevictable) > + Â Â Â Â Â Â Â zone->pages_scanned -= nr_unevictable; > + Â Â Â else > + Â Â Â Â Â Â Â zone->pages_scanned = 0; > + Â Â Â spin_unlock(&zone->lru_lock); > + > Â Â Â Âputback_lru_pages(zone, sc, nr_anon, nr_file, &page_list); > > Â Â Â Âtrace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id, > -- > 1.7.1 > > === > > Then, what I doubt secondly is zone_set_flag(zone, ZONE_CONGESTED). > He used swap as crypted device mapper. > Device mapper could make IO slow for his work. It means we are likely > to meet ZONE_CONGESTED higher than normal swap. > > Let's think about it. > Swap device is very congested so shrink_page_list would set the zone > as CONGESTED. > Who is clear ZONE_CONGESTED? There are two place in Âkswapd. > One work in only order > 0. So maybe, it's no-op in Andy's > workload.(ie, it's mostly order-0 allocation) > One remained is below. > > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â * If a zone reaches its high watermark, > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â * consider it to be no longer congested. It's > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â * possible there are dirty pages backed by > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â * congested BDIs but as pressure is relieved, > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â * spectulatively avoid congestion waits > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â */ > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âzone_clear_flag(zone, ZONE_CONGESTED); > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âif (i <= *classzone_idx) > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âbalanced += zone->present_pages; > > It works only if the zone meets high watermark. If allocation is > faster than reclaim(ie, it's true for slow swap device), the zone > would remain congested. > It means swapout would block. > As we see the OOM log, we can know that DMA32 zone can't meet high watermark. > > Does my guessing make sense? Hi Andrew. I got failed your scenario in my machine so could you be willing to test this patch for proving my above scenario? The patch is just revert patch of 0e093d99[do not sleep on the congestion queue...] for 2.6.38.6. I would like to test it for proving my above zone congestion scenario. I did it based on 2.6.38.6 for your easy apply so you must apply it cleanly on vanilla v2.6.38.6. And you have to add !pgdat_balanced and shrink_slab patch. Thanks, Andrew. -- Kind regards, Minchan Kim
From 244e37f1f3978ff182b5e33b77b327e4f48bb438 Mon Sep 17 00:00:00 2001 From: Minchan Kim <minchan.kim@xxxxxxxxx> Date: Mon, 30 May 2011 02:23:49 +0900 Subject: [PATCH] Revert "writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone" This reverts commit 0e093d99763eb4cea09f8ca4f1d01f34e121d10b. Conflicts: mm/vmscan.c Signed-off-by: Minchan Kim <minchan.kim@xxxxxxxxx> --- include/linux/backing-dev.h | 2 +- include/linux/mmzone.h | 8 ----- include/trace/events/writeback.h | 7 ---- mm/backing-dev.c | 61 +------------------------------------ mm/page_alloc.c | 4 +- mm/vmscan.c | 41 ++----------------------- 6 files changed, 9 insertions(+), 114 deletions(-) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 4ce34fa..8b0ae8b 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -286,7 +286,7 @@ enum { void clear_bdi_congested(struct backing_dev_info *bdi, int sync); void set_bdi_congested(struct backing_dev_info *bdi, int sync); long congestion_wait(int sync, long timeout); -long wait_iff_congested(struct zone *zone, int sync, long timeout); + static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi) { diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 02ecb01..e1b16aa 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -424,9 +424,6 @@ struct zone { typedef enum { ZONE_RECLAIM_LOCKED, /* prevents concurrent reclaim */ ZONE_OOM_LOCKED, /* zone is in OOM killer zonelist */ - ZONE_CONGESTED, /* zone has many dirty pages backed by - * a congested BDI - */ } zone_flags_t; static inline void zone_set_flag(struct zone *zone, zone_flags_t flag) @@ -444,11 +441,6 @@ static inline void zone_clear_flag(struct zone *zone, zone_flags_t flag) clear_bit(flag, &zone->flags); } -static inline int zone_is_reclaim_congested(const struct zone *zone) -{ - return test_bit(ZONE_CONGESTED, &zone->flags); -} - static inline int zone_is_reclaim_locked(const struct zone *zone) { return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags); diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 4e249b9..fc2b3a0 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -180,13 +180,6 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait, TP_ARGS(usec_timeout, usec_delayed) ); -DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested, - - TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed), - - TP_ARGS(usec_timeout, usec_delayed) -); - #endif /* _TRACE_WRITEBACK_H */ /* This part must be outside protection */ diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 8e4ed88..c9e59de 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -729,7 +729,6 @@ static wait_queue_head_t congestion_wqh[2] = { __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]), __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1]) }; -static atomic_t nr_bdi_congested[2]; void clear_bdi_congested(struct backing_dev_info *bdi, int sync) { @@ -737,8 +736,7 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync) wait_queue_head_t *wqh = &congestion_wqh[sync]; bit = sync ? BDI_sync_congested : BDI_async_congested; - if (test_and_clear_bit(bit, &bdi->state)) - atomic_dec(&nr_bdi_congested[sync]); + clear_bit(bit, &bdi->state); smp_mb__after_clear_bit(); if (waitqueue_active(wqh)) wake_up(wqh); @@ -750,8 +748,7 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync) enum bdi_state bit; bit = sync ? BDI_sync_congested : BDI_async_congested; - if (!test_and_set_bit(bit, &bdi->state)) - atomic_inc(&nr_bdi_congested[sync]); + set_bit(bit, &bdi->state); } EXPORT_SYMBOL(set_bdi_congested); @@ -782,57 +779,3 @@ long congestion_wait(int sync, long timeout) } EXPORT_SYMBOL(congestion_wait); -/** - * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes - * @zone: A zone to check if it is heavily congested - * @sync: SYNC or ASYNC IO - * @timeout: timeout in jiffies - * - * In the event of a congested backing_dev (any backing_dev) and the given - * @zone has experienced recent congestion, this waits for up to @timeout - * jiffies for either a BDI to exit congestion of the given @sync queue - * or a write to complete. - * - * In the absense of zone congestion, cond_resched() is called to yield - * the processor if necessary but otherwise does not sleep. - * - * The return value is 0 if the sleep is for the full timeout. Otherwise, - * it is the number of jiffies that were still remaining when the function - * returned. return_value == timeout implies the function did not sleep. - */ -long wait_iff_congested(struct zone *zone, int sync, long timeout) -{ - long ret; - unsigned long start = jiffies; - DEFINE_WAIT(wait); - wait_queue_head_t *wqh = &congestion_wqh[sync]; - - /* - * If there is no congestion, or heavy congestion is not being - * encountered in the current zone, yield if necessary instead - * of sleeping on the congestion queue - */ - if (atomic_read(&nr_bdi_congested[sync]) == 0 || - !zone_is_reclaim_congested(zone)) { - cond_resched(); - - /* In case we scheduled, work out time remaining */ - ret = timeout - (jiffies - start); - if (ret < 0) - ret = 0; - - goto out; - } - - /* Sleep until uncongested or a write happens */ - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, &wait); - -out: - trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout), - jiffies_to_usecs(jiffies - start)); - - return ret; -} -EXPORT_SYMBOL(wait_iff_congested); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2828037..71e9842 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1929,7 +1929,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order, preferred_zone, migratetype); if (!page && gfp_mask & __GFP_NOFAIL) - wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50); + congestion_wait(BLK_RW_ASYNC, HZ/50); } while (!page && (gfp_mask & __GFP_NOFAIL)); return page; @@ -2137,7 +2137,7 @@ rebalance: pages_reclaimed += did_some_progress; if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) { /* Wait for some write requests to complete then retry */ - wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50); + congestion_wait(BLK_RW_ASYNC, HZ/50); goto rebalance; } else { /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 0665520..59de427 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -703,14 +703,11 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages) * shrink_page_list() returns the number of reclaimed pages */ static unsigned long shrink_page_list(struct list_head *page_list, - struct zone *zone, struct scan_control *sc) { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); int pgactivate = 0; - unsigned long nr_dirty = 0; - unsigned long nr_congested = 0; unsigned long nr_reclaimed = 0; cond_resched(); @@ -730,7 +727,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, goto keep; VM_BUG_ON(PageActive(page)); - VM_BUG_ON(page_zone(page) != zone); sc->nr_scanned++; @@ -808,8 +804,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, } if (PageDirty(page)) { - nr_dirty++; - if (references == PAGEREF_RECLAIM_CLEAN) goto keep_locked; if (!may_enter_fs) @@ -820,7 +814,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, /* Page is dirty, try to write it out here */ switch (pageout(page, mapping, sc)) { case PAGE_KEEP: - nr_congested++; goto keep_locked; case PAGE_ACTIVATE: goto activate_locked; @@ -931,15 +924,6 @@ keep_lumpy: VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); } - /* - * Tag a zone as congested if all the dirty pages encountered were - * backed by a congested BDI. In this case, reclaimers should just - * back off and wait for congestion to clear because further reclaim - * will encounter the same problem - */ - if (nr_dirty == nr_congested && nr_dirty != 0) - zone_set_flag(zone, ZONE_CONGESTED); - free_page_list(&free_pages); list_splice(&ret_pages, page_list); @@ -1426,12 +1410,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, spin_unlock_irq(&zone->lru_lock); - nr_reclaimed = shrink_page_list(&page_list, zone, sc); + nr_reclaimed = shrink_page_list(&page_list, sc); /* Check if we should syncronously wait for writeback */ if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) { set_reclaim_mode(priority, sc, true); - nr_reclaimed += shrink_page_list(&page_list, zone, sc); + nr_reclaimed += shrink_page_list(&page_list, sc); } local_irq_disable(); @@ -2085,14 +2069,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, /* Take a nap, wait for some writeback to complete */ if (!sc->hibernation_mode && sc->nr_scanned && - priority < DEF_PRIORITY - 2) { - struct zone *preferred_zone; - - first_zones_zonelist(zonelist, gfp_zone(sc->gfp_mask), - &cpuset_current_mems_allowed, - &preferred_zone); - wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/10); - } + priority < DEF_PRIORITY - 2) + congestion_wait(BLK_RW_ASYNC, HZ/10); } out: @@ -2455,14 +2433,6 @@ loop_again: min_wmark_pages(zone), end_zone, 0)) has_under_min_watermark_zone = 1; } else { - /* - * If a zone reaches its high watermark, - * consider it to be no longer congested. It's - * possible there are dirty pages backed by - * congested BDIs but as pressure is relieved, - * spectulatively avoid congestion waits - */ - zone_clear_flag(zone, ZONE_CONGESTED); if (i <= *classzone_idx) balanced += zone->present_pages; } @@ -2546,9 +2516,6 @@ out: order = sc.order = 0; goto loop_again; } - - /* If balanced, clear the congested flag */ - zone_clear_flag(zone, ZONE_CONGESTED); } } -- 1.7.0.4