Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there

hejianet <hejianet@xxxxxxxxx> · Wed, 22 Feb 2017 22:31:50 +0800

Hi Michal

On 22/02/2017 7:41 PM, Michal Hocko wrote:
On Wed 22-02-17 17:04:48, Jia He wrote:
When I try to dynamically allocate the hugepages more than system total
free memory:
e.g. echo 4000 >/proc/sys/vm/nr_hugepages

I assume that the command has terminated with less huge pages allocated
than requested but

Yes, at last the allocated hugepages are less than 4000
HugePages_Total:    1864
HugePages_Free:     1864
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:      16384 kB

In the bad case, although kswapd takes 100% cpu, the number of
HugePages_Total is not increase at all.

Node 3, zone      DMA
[...]
  pages free     2951
        min      2821
        low      3526
        high     4231

it left the zone below high watermark with

   node_scanned  0
        spanned  245760
        present  245760
        managed  245388
      nr_free_pages 2951
      nr_zone_inactive_anon 0
      nr_zone_active_anon 0
      nr_zone_inactive_file 0
      nr_zone_active_file 0

no pages reclaimable, so kswapd will not go to sleep. It would be quite
easy and comfortable to call it a misconfiguration but it seems that
it might be quite easy to hit with NUMA machines which have large
differences in the node sizes. I guess it makes sense to back off
the kswapd rather than burning CPU without any way to make forward
progress.
agree.

[...]

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 532a2a7..a05e3ab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3139,7 +3139,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		if (!managed_zone(zone))
 			continue;

-		if (!zone_balanced(zone, order, classzone_idx))
+		if (!zone_balanced(zone, order, classzone_idx)
+			&& zone_reclaimable_pages(zone))
 			return false;

OK, this makes some sense, although zone_reclaimable_pages doesn't count
SLAB reclaimable pages. So we might go to sleep with a reclaimable slab
still around. This is not really easy to address because the reclaimable
slab doesn't really imply that those pages will be reclaimed...
Yes, even in the bad case, when kswapd takes all the cpu, the reclaimable
pages are not decreased

 	}

@@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 {
 	pg_data_t *pgdat;
 	int z;
+	int node_has_relaimable_pages = 0;

 	if (!managed_zone(zone))
 		return;
@@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)

 		if (zone_balanced(zone, order, classzone_idx))
 			return;
+
+		if (!zone_reclaimable_pages(zone))
+			node_has_relaimable_pages = 1;

What, this doesn't make any sense? Did you mean if (zone_reclaimable_pages)?
I mean, if any one zone has reclaimable pages, then this zone's *node* has
reclaimable pages. Thus, the kswapN for this node should be waken up.
e.g. node 1 has 2 zones.
zone A has no reclaimable pages but zone B has.
Thus node 1 has reclaimable pages, and kswapd1 will be waken up.
I use node_has_relaimable_pages in the loop to check all the zones' reclaimable
pages number. So I prefer the name node_has_relaimable_pages instead of
zone_has_relaimable_pages

Did I understand it correctly? Thanks

B.R.
Jia

 	}

+	/* Dont wake kswapd if no reclaimable pages */
+	if (!node_has_relaimable_pages)
+		return;
+
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
--
1.8.5.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>