On Thu, Aug 22, 2013 at 2:24 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote: > > Hello Lisa, > > Please fix your mail client. I apologize for this, I tried to fix, seems still have issue, so changed to gmail. If still have issue, please let me know, thanks! > > > On Wed, Aug 21, 2013 at 10:24:07PM -0700, Lisa Du wrote: > > >-----Original Message----- > > >From: Andrew Morton [mailto:akpm@xxxxxxxxxxxxxxxxxxxx] > > >Sent: 2013年8月21日 6:17 > > >To: Lisa Du > > >Cc: Johannes Weiner; Michal Hocko; linux-mm@xxxxxxxxx; Minchan Kim; KOSAKI Motohiro; Mel Gorman; Christoph Lameter; Bob Liu; > > >Neil Zhang; Russell King - ARM Linux; Aaditya Kumar; yinghan@xxxxxxxxxx; npiggin@xxxxxxxxx; riel@xxxxxxxxxx; > > >kamezawa.hiroyu@xxxxxxxxxxxxxx > > >Subject: Re: [resend] [PATCH V3] mm: vmscan: fix do_try_to_free_pages() livelock > > > > > >On Sun, 11 Aug 2013 18:46:08 -0700 Lisa Du <cldu@xxxxxxxxxxx> wrote: > > > > > >> In this version: > > >> Reorder the check in pgdat_balanced according Johannes's comment. > > >> > > >> >From 66a98566792b954e187dca251fbe3819aeb977b9 Mon Sep 17 00:00:00 > > >> >2001 > > >> From: Lisa Du <cldu@xxxxxxxxxxx> > > >> Date: Mon, 5 Aug 2013 09:26:57 +0800 > > >> Subject: [PATCH] mm: vmscan: fix do_try_to_free_pages() livelock > > >> > > >> This patch is based on KOSAKI's work and I add a little more > > >> description, please refer https://lkml.org/lkml/2012/6/14/74. > > >> > > >> Currently, I found system can enter a state that there are lots of > > >> free pages in a zone but only order-0 and order-1 pages which means > > >> the zone is heavily fragmented, then high order allocation could make > > >> direct reclaim path's long stall(ex, 60 seconds) especially in no swap > > >> and no compaciton enviroment. This problem happened on v3.4, but it > > >> seems issue still lives in current tree, the reason is > > >> do_try_to_free_pages enter live lock: > > >> > > >> kswapd will go to sleep if the zones have been fully scanned and are > > >> still not balanced. As kswapd thinks there's little point trying all > > >> over again to avoid infinite loop. Instead it changes order from > > >> high-order to 0-order because kswapd think order-0 is the most > > >> important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd > > >> will go back to sleep and may leave zone->all_unreclaimable = 0. > > >> It assume high-order users can still perform direct reclaim if they wish. > > >> > > >> Direct reclaim continue to reclaim for a high order which is not a > > >> COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble. > > >> This is because to avoid too early oom-kill. So it means > > >> direct_reclaim depends on kswapd to break this loop. > > >> > > >> In worst case, direct-reclaim may continue to page reclaim forever > > >> when kswapd sleeps forever until someone like watchdog detect and > > >> finally kill the process. As described in: > > >> http://thread.gmane.org/gmane.linux.kernel.mm/103737 > > >> > > >> We can't turn on zone->all_unreclaimable from direct reclaim path > > >> because direct reclaim path don't take any lock and this way is racy. > > > > > >I don't see that this is correct. Page reclaim does racy things quite often, in the knowledge that the effects of a race are > > >recoverable and small. > > Maybe Kosaki can give some comments, I think the mainly reason maybe direct reclaim don't take any lock. > > You have to write the problem out in detail. > It doesn't related to lock and race. > The problem is kswapd could sleep in highly-fragemented case > if it was woke up by high-order alloc because kswapd don't want > to reclaim too many pages by high-order allocation so it just > check order-0 pages once it try to reclaim for high order > allocation. > > * Fragmentation may mean that the system cannot be rebalanced > * for high-order allocations in all zones. If twice the > * allocation size has been reclaimed and the zones are still > * not balanced then recheck the watermarks at order-0 to > * prevent kswapd reclaiming excessively. Assume that a > * process requested a high-order can direct reclaim/compact. > */ > if (order && sc.nr_reclaimed >= 2UL << order) > order = sc.order = 0; > > But direct reclaim cannot meet kswapd's expectation because > although it has lots of free pages, it couldn't reclaim at all > because it has no swap device but lots of anon pages so it > should stop scanning and kill someone and it depends on the logic > > /* top priority shrink_zones still had more to do? don't OOM, then */ > if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc)) > return 1; > > In addtion that, all_unreclaimable is just check zone->all_unreclaimable. > Then, who set it? kswapd. What is kswapd doing? Sleep. > So, do_try_free_pages return 1, in turn, direct reclaimer think there is > progress so do not kill anything. Thanks for the detail description for this issue! It's exactly the issue. I thought Andrew is asking "We can't turn on zone->all_unreclaimable from direct reclaim path because direct reclaim path don't take any lock and this way is racy." Actually I'm not very clear why turn on zone->all_unreclaimable in direct reclaim path is not good. This statement I just copied from the original patch from Kosaki. > > > It's just an example from your log and we can see more potential bugs > and I agree with Michal that "all_unreclaimable was just a bad idea > from the very beginning". It's very subtle and error-prone. > > > > > > >> Thus this patch removes zone->all_unreclaimable field completely and > > >> recalculates zone reclaimable state every time. > > >> > > >> Note: we can't take the idea that direct-reclaim see > > >> zone->pages_scanned directly and kswapd continue to use > > >> zone->all_unreclaimable. Because, it is racy. commit 929bea7c71 > > >> (vmscan: all_unreclaimable() use > > >> zone->all_unreclaimable as a name) describes the detail. > > >> > > >> @@ -99,4 +100,23 @@ static __always_inline enum lru_list page_lru(struct page *page) > > >> return lru; > > >> } > > >> > > >> +static inline unsigned long zone_reclaimable_pages(struct zone *zone) > > >> +{ > > >> + int nr; > > >> + > > >> + nr = zone_page_state(zone, NR_ACTIVE_FILE) + > > >> + zone_page_state(zone, NR_INACTIVE_FILE); > > >> + > > >> + if (get_nr_swap_pages() > 0) > > >> + nr += zone_page_state(zone, NR_ACTIVE_ANON) + > > >> + zone_page_state(zone, NR_INACTIVE_ANON); > > >> + > > >> + return nr; > > >> +} > > >> + > > >> +static inline bool zone_reclaimable(struct zone *zone) { > > >> + return zone->pages_scanned < zone_reclaimable_pages(zone) * 6; } > > > > > >Inlining is often wrong. Uninlining just these two funtions saves several hundred bytes of text in mm/. That's three of someone > > >else's cachelines which we didn't need to evict. > > Would you explain more about why "inline is often wrong"? Thanks a lot! > > Andrew said it increases binary size several hundred bytes. > > Reclaim stuff is totally slow path in MM so normally, we don't buy such > bad deal. (code size VS speed) > > > > > > >And what the heck is up with that magical "6"? Why not "7"? "42"? > > This magical number "6" was first defined in commit d1908362ae0. > > Hi, Minchan, do you remember why we set this number? Thanks! > > Not me. It was introduced by [1] long time ago and long time ago, > it was just 4. I don't know who did at the first time. > Anyway, I agree it's very heuristic but have no idea because > some system might want to avoid OOM kill as slow as possible > because even a process killing is very critical for the system > while some system like android want to make OOM kill as soon as > possible because it prefers background process killing to system > slowness. So, IMHO, we would need a knob for tune. > > [1] 4ff1ffb4870b007, [PATCH] oom: reclaim_mapped on oom. Thanks for the explaining! > > > > > > > > >At a minimum it needs extensive documentation which describes why "6" > > >is the optimum value for all machines and workloads (good luck with > > >that) and which describes the effects of altering this number and which helps people understand why we didn't make it a runtime > > >tunable. > > > > > >I'll merge it for some testing (the lack of Tested-by's is conspicuous) but I don't want to put that random "6" into Linux core MM in > > >its current state. > > I did the test in kernel v3.4, it works fine and solve the endless loop in direct reclaim path, but not test with latest kernel version. > > > > > -- > Kind regards, > Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href