Hi, Andrew Would you please have a look at below patch and give your comments if any? Thanks a lot! Thanks! Best Regards Lisa Du >-----Original Message----- >From: Lisa Du >Sent: 2013年8月12日 9:46 >To: 'Johannes Weiner' >Cc: Michal Hocko; linux-mm@xxxxxxxxx; Minchan Kim; KOSAKI Motohiro; Mel Gorman; Christoph Lameter; Bob Liu; Neil Zhang; >Russell King - ARM Linux; Aaditya Kumar; yinghan@xxxxxxxxxx; npiggin@xxxxxxxxx; riel@xxxxxxxxxx; >kamezawa.hiroyu@xxxxxxxxxxxxxx >Subject: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock > >In this version: >Reorder the check in pgdat_balanced according Johannes's comment. > >From 66a98566792b954e187dca251fbe3819aeb977b9 Mon Sep 17 00:00:00 2001 >From: Lisa Du <cldu@xxxxxxxxxxx> >Date: Mon, 5 Aug 2013 09:26:57 +0800 >Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock > >This patch is based on KOSAKI's work and I add a little more description, please refer https://lkml.org/lkml/2012/6/14/74. > >Currently, I found system can enter a state that there are lots of free pages in a zone but only order-0 and order-1 pages which >means the zone is heavily fragmented, then high order allocation could make direct reclaim path's long stall(ex, 60 seconds) >especially in no swap and no compaciton enviroment. This problem happened on v3.4, but it seems issue still lives in current tree, >the reason is do_try_to_free_pages enter live lock: > >kswapd will go to sleep if the zones have been fully scanned and are still not balanced. As kswapd thinks there's little point trying >all over again to avoid infinite loop. Instead it changes order from high-order to 0-order because kswapd think order-0 is the most >important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep and may leave zone->all_unreclaimable = >0. >It assume high-order users can still perform direct reclaim if they wish. > >Direct reclaim continue to reclaim for a high order which is not a COSTLY_ORDER without oom-killer until kswapd turn on >zone->all_unreclaimble. >This is because to avoid too early oom-kill. So it means direct_reclaim depends on kswapd to break this loop. > >In worst case, direct-reclaim may continue to page reclaim forever when kswapd sleeps forever until someone like watchdog detect >and finally kill the process. As described in: >http://thread.gmane.org/gmane.linux.kernel.mm/103737 > >We can't turn on zone->all_unreclaimable from direct reclaim path because direct reclaim path don't take any lock and this way is >racy. >Thus this patch removes zone->all_unreclaimable field completely and recalculates zone reclaimable state every time. > >Note: we can't take the idea that direct-reclaim see zone->pages_scanned directly and kswapd continue to use >zone->all_unreclaimable. Because, it is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use >zone->all_unreclaimable as a name) describes the detail. > >Cc: Aaditya Kumar <aaditya.kumar.30@xxxxxxxxx> >Cc: Ying Han <yinghan@xxxxxxxxxx> >Cc: Nick Piggin <npiggin@xxxxxxxxx> >Acked-by: Rik van Riel <riel@xxxxxxxxxx> >Cc: Mel Gorman <mel@xxxxxxxxx> >Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> >Cc: Christoph Lameter <cl@xxxxxxxxx> >Cc: Bob Liu <lliubbo@xxxxxxxxx> >Cc: Neil Zhang <zhangwm@xxxxxxxxxxx> >Cc: Russell King - ARM Linux <linux@xxxxxxxxxxxxxxxx> >Reviewed-by: Michal Hocko <mhocko@xxxxxxx> >Acked-by: Minchan Kim <minchan@xxxxxxxxxx> >Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx> >Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx> >Signed-off-by: Lisa Du <cldu@xxxxxxxxxxx> >--- > include/linux/mm_inline.h | 20 +++++++++++++++++++ > include/linux/mmzone.h | 1 - > include/linux/vmstat.h | 1 - > mm/page-writeback.c | 1 + > mm/page_alloc.c | 5 +-- > mm/vmscan.c | 47 +++++++++++--------------------------------- > mm/vmstat.c | 3 +- > 7 files changed, 37 insertions(+), 41 deletions(-) > >diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 1397ccf..e212fae 100644 >--- a/include/linux/mm_inline.h >+++ b/include/linux/mm_inline.h >@@ -2,6 +2,7 @@ > #define LINUX_MM_INLINE_H > > #include <linux/huge_mm.h> >+#include <linux/swap.h> > > /** > * page_is_file_cache - should the page be on a file LRU or anon LRU? >@@ -99,4 +100,23 @@ static __always_inline enum lru_list page_lru(struct page *page) > return lru; > } > >+static inline unsigned long zone_reclaimable_pages(struct zone *zone) { >+ int nr; >+ >+ nr = zone_page_state(zone, NR_ACTIVE_FILE) + >+ zone_page_state(zone, NR_INACTIVE_FILE); >+ >+ if (get_nr_swap_pages() > 0) >+ nr += zone_page_state(zone, NR_ACTIVE_ANON) + >+ zone_page_state(zone, NR_INACTIVE_ANON); >+ >+ return nr; >+} >+ >+static inline bool zone_reclaimable(struct zone *zone) { >+ return zone->pages_scanned < zone_reclaimable_pages(zone) * 6; } >+ > #endif >diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index af4a3b7..e835974 100644 >--- a/include/linux/mmzone.h >+++ b/include/linux/mmzone.h >@@ -352,7 +352,6 @@ struct zone { > * free areas of different sizes > */ > spinlock_t lock; >- int all_unreclaimable; /* All pages pinned */ > #if defined CONFIG_COMPACTION || defined CONFIG_CMA > /* Set to true when the PG_migrate_skip bits should be cleared */ > bool compact_blockskip_flush; >diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index c586679..6fff004 100644 >--- a/include/linux/vmstat.h >+++ b/include/linux/vmstat.h >@@ -143,7 +143,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone, } > > extern unsigned long global_reclaimable_pages(void); -extern unsigned long zone_reclaimable_pages(struct zone *zone); > > #ifdef CONFIG_NUMA > /* >diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 3f0c895..62bfd92 100644 >--- a/mm/page-writeback.c >+++ b/mm/page-writeback.c >@@ -36,6 +36,7 @@ > #include <linux/pagevec.h> > #include <linux/timer.h> > #include <linux/sched/rt.h> >+#include <linux/mm_inline.h> > #include <trace/events/writeback.h> > > /* >diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b100255..19a18c0 100644 >--- a/mm/page_alloc.c >+++ b/mm/page_alloc.c >@@ -60,6 +60,7 @@ > #include <linux/page-debug-flags.h> > #include <linux/hugetlb.h> > #include <linux/sched/rt.h> >+#include <linux/mm_inline.h> > > #include <asm/sections.h> > #include <asm/tlbflush.h> >@@ -647,7 +648,6 @@ static void free_pcppages_bulk(struct zone *zone, int count, > int to_free = count; > > spin_lock(&zone->lock); >- zone->all_unreclaimable = 0; > zone->pages_scanned = 0; > > while (to_free) { >@@ -696,7 +696,6 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > int migratetype) > { > spin_lock(&zone->lock); >- zone->all_unreclaimable = 0; > zone->pages_scanned = 0; > > __free_one_page(page, zone, order, migratetype); @@ -3095,7 +3094,7 @@ void show_free_areas(unsigned int filter) > K(zone_page_state(zone, NR_FREE_CMA_PAGES)), > K(zone_page_state(zone, NR_WRITEBACK_TEMP)), > zone->pages_scanned, >- (zone->all_unreclaimable ? "yes" : "no") >+ (!zone_reclaimable(zone) ? "yes" : "no") > ); > printk("lowmem_reserve[]:"); > for (i = 0; i < MAX_NR_ZONES; i++) >diff --git a/mm/vmscan.c b/mm/vmscan.c >index 2cff0d4..3fe3d5d 100644 >--- a/mm/vmscan.c >+++ b/mm/vmscan.c >@@ -1789,7 +1789,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > * latencies, so it's better to scan a minimum amount there as > * well. > */ >- if (current_is_kswapd() && zone->all_unreclaimable) >+ if (current_is_kswapd() && !zone_reclaimable(zone)) > force_scan = true; > if (!global_reclaim(sc)) > force_scan = true; >@@ -2244,8 +2244,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > if (global_reclaim(sc)) { > if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) > continue; >- if (zone->all_unreclaimable && >- sc->priority != DEF_PRIORITY) >+ if (sc->priority != DEF_PRIORITY && >+ !zone_reclaimable(zone)) > continue; /* Let kswapd poll it */ > if (IS_ENABLED(CONFIG_COMPACTION)) { > /* >@@ -2283,11 +2283,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > return aborted_reclaim; > } > >-static bool zone_reclaimable(struct zone *zone) -{ >- return zone->pages_scanned < zone_reclaimable_pages(zone) * 6; >-} >- > /* All zones in zonelist are unreclaimable? */ static bool all_unreclaimable(struct zonelist *zonelist, > struct scan_control *sc) >@@ -2301,7 +2296,7 @@ static bool all_unreclaimable(struct zonelist *zonelist, > continue; > if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) > continue; >- if (!zone->all_unreclaimable) >+ if (zone_reclaimable(zone)) > return false; > } > >@@ -2712,7 +2707,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) > * DEF_PRIORITY. Effectively, it considers them balanced so > * they must be considered balanced here as well! > */ >- if (zone->all_unreclaimable) { >+ if (!zone_reclaimable(zone)) { > balanced_pages += zone->managed_pages; > continue; > } >@@ -2773,7 +2768,6 @@ static bool kswapd_shrink_zone(struct zone *zone, > unsigned long lru_pages, > unsigned long *nr_attempted) > { >- unsigned long nr_slab; > int testorder = sc->order; > unsigned long balance_gap; > struct reclaim_state *reclaim_state = current->reclaim_state; @@ -2818,15 +2812,12 @@ static bool >kswapd_shrink_zone(struct zone *zone, > shrink_zone(zone, sc); > > reclaim_state->reclaimed_slab = 0; >- nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages); >+ shrink_slab(&shrink, sc->nr_scanned, lru_pages); > sc->nr_reclaimed += reclaim_state->reclaimed_slab; > > /* Account for the number of pages attempted to reclaim */ > *nr_attempted += sc->nr_to_reclaim; > >- if (nr_slab == 0 && !zone_reclaimable(zone)) >- zone->all_unreclaimable = 1; >- > zone_clear_flag(zone, ZONE_WRITEBACK); > > /* >@@ -2835,7 +2826,7 @@ static bool kswapd_shrink_zone(struct zone *zone, > * BDIs but as pressure is relieved, speculatively avoid congestion > * waits. > */ >- if (!zone->all_unreclaimable && >+ if (zone_reclaimable(zone) && > zone_balanced(zone, testorder, 0, classzone_idx)) { > zone_clear_flag(zone, ZONE_CONGESTED); > zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY); @@ -2901,8 +2892,8 @@ static unsigned long balance_pgdat(pg_data_t >*pgdat, int order, > if (!populated_zone(zone)) > continue; > >- if (zone->all_unreclaimable && >- sc.priority != DEF_PRIORITY) >+ if (sc.priority != DEF_PRIORITY && >+ !zone_reclaimable(zone)) > continue; > > /* >@@ -2980,8 +2971,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, > if (!populated_zone(zone)) > continue; > >- if (zone->all_unreclaimable && >- sc.priority != DEF_PRIORITY) >+ if (sc.priority != DEF_PRIORITY && >+ !zone_reclaimable(zone)) > continue; > > sc.nr_scanned = 0; >@@ -3265,20 +3256,6 @@ unsigned long global_reclaimable_pages(void) > return nr; > } > >-unsigned long zone_reclaimable_pages(struct zone *zone) -{ >- int nr; >- >- nr = zone_page_state(zone, NR_ACTIVE_FILE) + >- zone_page_state(zone, NR_INACTIVE_FILE); >- >- if (get_nr_swap_pages() > 0) >- nr += zone_page_state(zone, NR_ACTIVE_ANON) + >- zone_page_state(zone, NR_INACTIVE_ANON); >- >- return nr; >-} >- > #ifdef CONFIG_HIBERNATION > /* > * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of @@ -3576,7 +3553,7 @@ int zone_reclaim(struct >zone *zone, gfp_t gfp_mask, unsigned int order) > zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages) > return ZONE_RECLAIM_FULL; > >- if (zone->all_unreclaimable) >+ if (!zone_reclaimable(zone)) > return ZONE_RECLAIM_FULL; > > /* >diff --git a/mm/vmstat.c b/mm/vmstat.c >index 20c2ef4..c48f75b 100644 >--- a/mm/vmstat.c >+++ b/mm/vmstat.c >@@ -19,6 +19,7 @@ > #include <linux/math64.h> > #include <linux/writeback.h> > #include <linux/compaction.h> >+#include <linux/mm_inline.h> > > #ifdef CONFIG_VM_EVENT_COUNTERS > DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}}; @@ -1052,7 +1053,7 @@ static void >zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > "\n all_unreclaimable: %u" > "\n start_pfn: %lu" > "\n inactive_ratio: %u", >- zone->all_unreclaimable, >+ !zone_reclaimable(zone), > zone->zone_start_pfn, > zone->inactive_ratio); > seq_putc(m, '\n'); >-- >1.7.0.4 > > >Thanks! > >Best Regards >Lisa Du ?韬{.n???檩jg???a?旃???)钋???骅w+h?璀?y/i?⒏??⒎???Щ??m???)钋???痂?^??觥??ザ?v???O璁?f??i?⒏?