On 25 April 2012 04:37, Ying Han <yinghan@xxxxxxxxxx> wrote: > On Mon, Apr 23, 2012 at 10:36 PM, Nick Piggin <npiggin@xxxxxxxxx> wrote: >> On 24 April 2012 06:56, Ying Han <yinghan@xxxxxxxxxx> wrote: >>> This is not a patch targeted to be merged at all, but trying to understand >>> a logic in global direct reclaim. >>> >>> There is a logic in global direct reclaim where reclaim fails on priority 0 >>> and zone->all_unreclaimable is not set, it will cause the direct to start over >>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is >>> very likely caused by direct reclaim enters infinite loop. >> >> Very likely, or definitely? Can you reproduce it? What workload? > > No, we don't have reproduce workload for that yet. Everything is based > on the watchdog dump file :( > >> >>> >>> There have been serious patches trying to fix similar issue and the latest >>> patch has good summary of all the efforts: >>> >>> commit 929bea7c714220fc76ce3f75bef9056477c28e74 >>> Author: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx> >>> Date: Thu Apr 14 15:22:12 2011 -0700 >>> >>> vmscan: all_unreclaimable() use zone->all_unreclaimable as a name >>> >>> Kosaki explained the problem triggered by async zone->all_unreclaimable and >>> zone->pages_scanned where the later one was being checked by direct reclaim. >>> However, after the patch, the problem remains where the setting of >>> zone->all_unreclaimable is asynchronous with zone is actually reclaimable or not. >>> >>> The zone->all_unreclaimable flag is set by kswapd by checking zone->pages_scanned in >>> zone_reclaimable(). Is that possible to have zone->all_unreclaimable == false while >>> the zone is actually unreclaimable? >>> >>> 1. while kswapd in reclaim priority loop, someone frees a page on the zone. It >>> will end up resetting the pages_scanned. >>> >>> 2. kswapd is frozen for whatever reason. I noticed Kosaki's covered the >>> hibernation case by checking oom_killer_disabled, but not sure if that is >>> everything we need to worry about. The key point here is that direct reclaim >>> relies on a flag which is set by kswapd asynchronously, that doesn't sound safe. >>> >>> Instead of keep fixing the problem, I am wondering why we have the logic >>> "not oom but keep trying reclaim w/ priority 0 reclaim failure" at the first place: >>> >>> Here is the patch introduced the logic initially: >>> >>> commit 408d85441cd5a9bd6bc851d677a10c605ed8db5f >>> Author: Nick Piggin <npiggin@xxxxxxx> >>> Date: Mon Sep 25 23:31:27 2006 -0700 >>> >>> [PATCH] oom: use unreclaimable info >>> >>> However, I didn't find detailed description of what problem the commit trying >>> to fix and wondering if the problem still exist after 5 years. I would be happy >>> to see the later case where we can consider to revert the initial patch. >> >> The problem we were having is that processes would be killed at seemingly >> random points of time, under heavy swapping, but long before all swap was >> used. >> >> The particular problem IIRC was related to testing a lot of guests on an s390 >> machine. I'm ashamed to have not included more information in the >> changelog -- I suspect it was probably in a small batch of patches with a >> description in the introductory mail and not properly placed into patches :( >> >> There are certainly a lot of changes in the area since then, so I couldn't be >> sure of what will happen by taking this out. >> >> I don't think the page allocator "try harder" logic was enough to solve the >> problem, and I think it was around in some form even back then. >> >> The biggest problem is that it's not an exact science. It will never do the >> right thing for everybody, sadly. Even if it is able to allocate pages at a >> very slow rate, this is effectively as good as a hang for some users. For >> others, they want to be able to manually intervene before anything is killed. >> >> Sorry if this isn't too helpful! Any ideas would be good. Possibly need to have >> a way to describe these behaviours in an abstract way (i.e., not just magic >> numbers), and allow user to tune it. > > Thank you Nick and this is helpful. I looked up on the patches you > mentioned, and I can see what problem they were trying to solve by > that time. However things have been changed a lot, and it is hard to > tell if the problem still remains on the current kernel or not. By > spotting each by each, I see either the patch has been replaced by > different logic or the same logic has been implemented differently. > > For this particular one patch, we now have code which does page alloc > retry before entering OOM. So I am wondering if that will help the OOM > situation by that time. Well it's not doing exactly the same thing, actually. And note that the problem was not about parallel OOM-killing. The fact that the page reclaim has not made any progress when we last called in does not actually mean that it cannot make _any_ progress. My patch is more about detecting the latter case. I don't see there is equivalent logic in page allocator to replace it. But again: this is not a question of correct or incorrect as far as I can see, simply a matter of where you define "hopeless"! I could easily see the need for way to bias that (kill quickly, medium, try to never kill). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href