On Sat, Dec 08, 2012 at 05:01:42PM -0800, Linus Torvalds wrote: > > > On Sat, 8 Dec 2012, Zlatko Calusic wrote: > > > > Or sooner... in short: nothing's changed! > > > > On a 4GB RAM system, where applications use close to 2GB, kswapd likes to keep > > around 1GB free (unused), leaving only 1GB for page/buffer cache. If I force > > bigger page cache by reading a big file and thus use the unused 1GB of RAM, > > kswapd will soon (in a matter of minutes) evict those (or other) pages out and > > once again keep unused memory close to 1GB. > > Ok, guys, what was the reclaim or kswapd patch during the merge window > that actually caused all of these insane problems? I believe commit c6543459 (mm: remove __GFP_NO_KSWAPD) is the primary candidate. __GFP_NO_KSWAPD was originally introduced by THP because kswapd was excessively reclaiming. kswapd would stay awake aggressively reclaiming even if compaction was deferred. The flag was removed in this cycle when it was expected that it was no longer necessary. I'm not foisting the blame on Rik here, I was on the review list for that patch and did not identify that it would cause this many problems either. > It seems it was more > fundamentally buggered than the fifteen-million fixes for kswapd we have > already picked up. > It was already fundamentally buggered up. The difference was it stayed asleep for THP requests in earlier kernels. There is a big difference between a direct reclaim/compaction for THP and kswapd doing the same work. Direct reclaim/compaction will try once, give up quickly and defer requests in the near future to avoid impacting the system heavily for THP. The same applies for khugepaged. kswapd is different. It can keep going until it meets its watermarks for a THP allocation are met. Two reasons why it might keep going for a long time are that compaction is being inefficient which we know it may be due to crap like this end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages); and the second reason is if the highest zone is relatively because compaction_suitable will keep saying that allocations are failing due to insufficient amounts of memory in the highest zone. It'll reclaim a little from this highest zone and then shrink_slab() potentially dumping a large amount of memory. This may be the case for Zlatko as with a 4G machine his ZONE_NORMAL could be small depending on how the 32-bit address space is used by his hardware. > (Ok, I may be exaggerating the number of patches, but it's starting to > feel that way - I thought that 3.7 was going to be a calm and easy > release, but the kswapd issues seem to just keep happening. We've been > fighting the kswapd changes for a while now.) > Yes. > Trying to keep a gigabyte free (presumably because that way we have lots > of high-order alloction pages) is ridiculous. Is it one of the compaction > changes? > Not directly. Compaction has been a bigger factor after 3.5 due to the removal of lumpy reclaim but it's not directly responsible for excessive amounts of memory being kept free. The closest patch I'm aware of that would cause problems of that nature would be commit 83fde0f2 (mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures) and it has already been reverted by 96710098. > Mel? Ideas? > Consider reverting the revert of __GFP_NO_KSWAPD again until this can be ironed out at a more reasonable pace. Rik? Johannes? Verify if the shrinking slab is the issue with this brutually ugly hack. Zlatko? diff --git a/mm/vmscan.c b/mm/vmscan.c index b7ed376..2189d20 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2550,6 +2550,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, unsigned long balanced; int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ + bool should_shrink_slab = true; unsigned long total_scanned; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long nr_soft_reclaimed; @@ -2695,7 +2696,8 @@ loop_again: shrink_zone(zone, &sc); reclaim_state->reclaimed_slab = 0; - nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages); + if (should_shrink_slab) + nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages); sc.nr_reclaimed += reclaim_state->reclaimed_slab; total_scanned += sc.nr_scanned; @@ -2817,6 +2819,16 @@ out: if (order) { int zones_need_compaction = 1; + /* + * Shrinking slab for high-order allocs can cause an excessive + * amount of memory to be dumped. Only shrink slab once per + * round for high-order allocs. + * + * This is a very stupid hack. balance_pgdat() is in serious + * need of a rework + */ + should_shrink_slab = false; + for (i = 0; i <= end_zone; i++) { struct zone *zone = pgdat->node_zones + i; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>