>-----Original Message----- >From: Andrew Morton [mailto:akpm@xxxxxxxxxxxxxxxxxxxx] >Sent: 2013年8月21日 6:17 >To: Lisa Du >Cc: Johannes Weiner; Michal Hocko; linux-mm@xxxxxxxxx; Minchan Kim; KOSAKI Motohiro; Mel Gorman; Christoph Lameter; Bob Liu; >Neil Zhang; Russell King - ARM Linux; Aaditya Kumar; yinghan@xxxxxxxxxx; npiggin@xxxxxxxxx; riel@xxxxxxxxxx; >kamezawa.hiroyu@xxxxxxxxxxxxxx >Subject: Re: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock > >On Sun, 11 Aug 2013 18:46:08 -0700 Lisa Du <cldu@xxxxxxxxxxx> wrote: > >> In this version: >> Reorder the check in pgdat_balanced according Johannes's comment. >> >> >From 66a98566792b954e187dca251fbe3819aeb977b9 Mon Sep 17 00:00:00 >> >2001 >> From: Lisa Du <cldu@xxxxxxxxxxx> >> Date: Mon, 5 Aug 2013 09:26:57 +0800 >> Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock >> >> This patch is based on KOSAKI's work and I add a little more >> description, please refer https://lkml.org/lkml/2012/6/14/74. >> >> Currently, I found system can enter a state that there are lots of >> free pages in a zone but only order-0 and order-1 pages which means >> the zone is heavily fragmented, then high order allocation could make >> direct reclaim path's long stall(ex, 60 seconds) especially in no swap >> and no compaciton enviroment. This problem happened on v3.4, but it >> seems issue still lives in current tree, the reason is >> do_try_to_free_pages enter live lock: >> >> kswapd will go to sleep if the zones have been fully scanned and are >> still not balanced. As kswapd thinks there's little point trying all >> over again to avoid infinite loop. Instead it changes order from >> high-order to 0-order because kswapd think order-0 is the most >> important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd >> will go back to sleep and may leave zone->all_unreclaimable = 0. >> It assume high-order users can still perform direct reclaim if they wish. >> >> Direct reclaim continue to reclaim for a high order which is not a >> COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble. >> This is because to avoid too early oom-kill. So it means >> direct_reclaim depends on kswapd to break this loop. >> >> In worst case, direct-reclaim may continue to page reclaim forever >> when kswapd sleeps forever until someone like watchdog detect and >> finally kill the process. As described in: >> http://thread.gmane.org/gmane.linux.kernel.mm/103737 >> >> We can't turn on zone->all_unreclaimable from direct reclaim path >> because direct reclaim path don't take any lock and this way is racy. > >I don't see that this is correct. Page reclaim does racy things quite often, in the knowledge that the effects of a race are >recoverable and small. Maybe Kosaki can give some comments, I think the mainly reason maybe direct reclaim don't take any lock. > >> Thus this patch removes zone->all_unreclaimable field completely and >> recalculates zone reclaimable state every time. >> >> Note: we can't take the idea that direct-reclaim see >> zone->pages_scanned directly and kswapd continue to use >> zone->all_unreclaimable. Because, it is racy. commit 929bea7c71 >> (vmscan: all_unreclaimable() use >> zone->all_unreclaimable as a name) describes the detail. >> >> @@ -99,4 +100,23 @@ static __always_inline enum lru_list page_lru(struct page *page) >> return lru; >> } >> >> +static inline unsigned long zone_reclaimable_pages(struct zone *zone) >> +{ >> + int nr; >> + >> + nr = zone_page_state(zone, NR_ACTIVE_FILE) + >> + zone_page_state(zone, NR_INACTIVE_FILE); >> + >> + if (get_nr_swap_pages() > 0) >> + nr += zone_page_state(zone, NR_ACTIVE_ANON) + >> + zone_page_state(zone, NR_INACTIVE_ANON); >> + >> + return nr; >> +} >> + >> +static inline bool zone_reclaimable(struct zone *zone) { >> + return zone->pages_scanned < zone_reclaimable_pages(zone) * 6; } > >Inlining is often wrong. Uninlining just these two funtions saves several hundred bytes of text in mm/. That's three of someone >else's cachelines which we didn't need to evict. Would you explain more about why "inline is often wrong"? Thanks a lot! > >And what the heck is up with that magical "6"? Why not "7"? "42"? This magical number "6" was first defined in commit d1908362ae0. Hi, Minchan, do you remember why we set this number? Thanks! > >At a minimum it needs extensive documentation which describes why "6" >is the optimum value for all machines and workloads (good luck with >that) and which describes the effects of altering this number and which helps people understand why we didn't make it a runtime >tunable. > >I'll merge it for some testing (the lack of Tested-by's is conspicuous) but I don't want to put that random "6" into Linux core MM in >its current state. I did the test in kernel v3.4, it works fine and solve the endless loop in direct reclaim path, but not test with latest kernel version. > ?韬{.n???檩jg???a?旃???)钋???骅w+h?璀?y/i?⒏??⒎???Щ??m???)钋???痂?^??觥??ザ?v???O璁?f??i?⒏?