On Fri, 2011-05-13 at 00:15 +0200, Johannes Weiner wrote: > On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote: > > On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote: > > > Confirmed, I'm afraid ... I can trigger the problem with all three > > > patches under PREEMPT. It's not a hang this time, it's just kswapd > > > taking 100% system time on 1 CPU and it won't calm down after I unload > > > the system. > > > > Just on a "if you don't know what's wrong poke about and see" basis, I > > sliced out all the complex logic in sleeping_prematurely() and, as far > > as I can tell, it cures the problem behaviour. I've loaded up the > > system, and taken the tar load generator through three runs without > > producing a spinning kswapd (this is PREEMPT). I'll try with a > > non-PREEMPT kernel shortly. > > > > What this seems to say is that there's a problem with the complex logic > > in sleeping_prematurely(). I'm pretty sure hacking up > > sleeping_prematurely() just to dump all the calculations is the wrong > > thing to do, but perhaps someone can see what the right thing is ... > > I think I see the problem: the boolean logic of sleeping_prematurely() > is odd. If it returns true, kswapd will keep running. So if > pgdat_balanced() returns true, kswapd should go to sleep. > > This? I was going to say this was a winner, but on the third untar run on non-PREEMPT, I hit the kswapd livelock. It's got much farther than previous attempts, which all hang on the first run, but I think the essential problem is still (at least on this machine) that sleeping_prematurely() is doing too much work for the wakeup storm that allocators are causing. Something that ratelimits the amount of time we spend in the watermark calculations, like the below (which incorporates your pgdat fix) seems to be much more stable (I've not run it for three full runs yet, but kswapd CPU time is way lower so far). The heuristic here is that if we're making the calculation more than ten times in 1/10 of a second, stop and sleep anyway. James --- diff --git a/mm/vmscan.c b/mm/vmscan.c index 0665520..545250c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2249,12 +2249,32 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, { int i; unsigned long balanced = 0; - bool all_zones_ok = true; + bool all_zones_ok = true, ret; + static int returned_true = 0; + static unsigned long prev_jiffies = 0; + /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ if (remaining) return true; + /* rate limit our entry to the watermark calculations */ + if (time_after(prev_jiffies + HZ/10, jiffies)) { + /* previously returned false, do so again */ + if (returned_true == 0) + return false; + /* or we've done the true calculation too many times */ + if (returned_true++ > 10) + return false; + + return true; + } else { + /* haven't been here for a while, reset the true count */ + returned_true = 0; + } + + prev_jiffies = jiffies; + /* Check the watermark levels */ for (i = 0; i < pgdat->nr_zones; i++) { struct zone *zone = pgdat->node_zones + i; @@ -2286,9 +2306,16 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, * must be balanced */ if (order) - return pgdat_balanced(pgdat, balanced, classzone_idx); + ret = !pgdat_balanced(pgdat, balanced, classzone_idx); + else + ret = !all_zones_ok; + + if (ret) + returned_true++; else - return !all_zones_ok; + returned_true = 0; + + return ret; } /* -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html