Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
<jaroslav.pulchart@xxxxxxxxxxxx> wrote:
>
> >
> > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> > >
> > > >
> > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> > > > >
> > > > > >
> > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > Hi Jaroslav,
> > > > > > >
> > > > > > > Hi Yu Zhao
> > > > > > >
> > > > > > > thanks for response, see answers inline:
> > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > system (16numa domains).
> > > > > > > >
> > > > > > > > Kernel version please?
> > > > > > >
> > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > >
> > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > for you if you run into other problems with v6.6.
> > > > > >
> > > > >
> > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > >
> > > > > > > > > Symptoms of my issue are
> > > > > > > > >
> > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > >
> > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > >
> > > > > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > > > > > > 18.26, 15.01
> > > > > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > > > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > > > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > > > > > > >     ...
> > > > > > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > > > > > > 34969:04 kswapd3
> > > > > > > > >     ...
> > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > some kind of locking)
> > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > > > > > > 17.77, 14.77
> > > > > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > > > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > > > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > > > > > > >     ...
> > > > > > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > >     ...
> > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > >
> > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > >
> > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > >
> > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > what can be the reason for that behaviour?
> > > > > >
> > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > ahead (high order allocations).
> > > > > >
> > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > the system services and free the memory.
> > > > > >
> > > > > > Yes, this helps.
> > > > > >  Also could you attach /proc/buddyinfo from the moment
> > > > > > you hit the problem?
> > > > > >
> > > > >
> > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > >
> > > > > The output of /proc/buddyinfo is:
> > > > >
> > > > > # cat /proc/buddyinfo
> > > > > Node 0, zone      DMA      7      2      2      1      1      2      1
> > > > >      1      1      2      1
> > > > > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
> > > > >     61     43     23      4
> > > > > Node 0, zone   Normal     19    190    140    129    136     75     66
> > > > >     41      9      1      5
> > > > > Node 1, zone   Normal    194   1210   2080   1800    715    255    111
> > > > >     56     42     36     55
> > > > > Node 2, zone   Normal    204    768   3766   3394   1742    468    185
> > > > >    194    238     47     74
> > > > > Node 3, zone   Normal   1622   2137   1058    846    388    208     97
> > > > >     44     14     42     10
> > > >
> > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > normal zone, and this excludes the problem commit
> > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > reclaim") fixed in v6.6.
> > >
> > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > VMs only - This test does not always trigger the kswapd3 continuous
> > > swap in/out  usage but it uses it and it  looks like there is a
> > > change:
> > >
> > >  I can see kswapd non-continous (15s and more) usage with 6.5.y
> > >  # ps ax | grep [k]swapd
> > >     753 ?        S      0:00 [kswapd0]
> > >     754 ?        S      0:00 [kswapd1]
> > >     755 ?        S      0:00 [kswapd2]
> > >     756 ?        S      0:15 [kswapd3]    <<<<<<<<<
> > >     757 ?        S      0:00 [kswapd4]
> > >     758 ?        S      0:00 [kswapd5]
> > >     759 ?        S      0:00 [kswapd6]
> > >     760 ?        S      0:00 [kswapd7]
> > >     761 ?        S      0:00 [kswapd8]
> > >     762 ?        S      0:00 [kswapd9]
> > >     763 ?        S      0:00 [kswapd10]
> > >     764 ?        S      0:00 [kswapd11]
> > >     765 ?        S      0:00 [kswapd12]
> > >     766 ?        S      0:00 [kswapd13]
> > >     767 ?        S      0:00 [kswapd14]
> > >     768 ?        S      0:00 [kswapd15]
> > >
> > > and none kswapd usage with 6.6.1, that looks to be promising path
> > >
> > > # ps ax | grep [k]swapd
> > >     808 ?        S      0:00 [kswapd0]
> > >     809 ?        S      0:00 [kswapd1]
> > >     810 ?        S      0:00 [kswapd2]
> > >     811 ?        S      0:00 [kswapd3]    <<<< nice
> > >     812 ?        S      0:00 [kswapd4]
> > >     813 ?        S      0:00 [kswapd5]
> > >     814 ?        S      0:00 [kswapd6]
> > >     815 ?        S      0:00 [kswapd7]
> > >     816 ?        S      0:00 [kswapd8]
> > >     817 ?        S      0:00 [kswapd9]
> > >     818 ?        S      0:00 [kswapd10]
> > >     819 ?        S      0:00 [kswapd11]
> > >     820 ?        S      0:00 [kswapd12]
> > >     821 ?        S      0:00 [kswapd13]
> > >     822 ?        S      0:00 [kswapd14]
> > >     823 ?        S      0:00 [kswapd15]
> > >
> > > I will install the 6.6.1 on the server which is doing some work and
> > > observe it later today.
> >
> > Thanks. Fingers crossed.
>
> The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> The node 3 has 163MiB free of memory and I see
> just a few in/out swap usage sometimes (which is expected) and minimal
> kswapd3 process usage for almost 4days.

Thanks for the update!

Just to confirm:
1. MGLRU was enabled, and
2. The v6.6 deployed did NOT have the patch I attached earlier.
Are both correct?

If so, I'd very appreciate it if you could try the attached patch on
top of v6.5 and see if it helps. My suspicion is that the problem is
compaction related, i.e., kswapd was woken up by high order
allocations but didn't properly stop. But what causes the behavior
difference on v6.5 between MGLRU and the active/inactive LRU still
puzzles me --the problem might be somehow masked rather than fixed on
v6.6.

For any other problems that you suspect might be related to MGLRU,
please let me know and I'd be happy to look into them as well.
From 0353f19ee5e7b44da225c8d3333f242babca7de7 Mon Sep 17 00:00:00 2001
From: Yu Zhao <yuzhao@google.com>
Date: Wed, 8 Nov 2023 14:56:58 -0700
Subject: [PATCH] mm/mglru: don't overshoot high watermarks

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 mm/vmscan.c | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2fe4a11d63f4..80f2340037ca 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5329,20 +5329,41 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
 	return try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false) ? -1 : 0;
 }
 
-static unsigned long get_nr_to_reclaim(struct scan_control *sc)
+static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 {
+	int i;
+	enum zone_watermarks mark;
+
 	/* don't abort memcg reclaim to ensure fairness */
 	if (!root_reclaim(sc))
-		return -1;
+		return false;
 
-	return max(sc->nr_to_reclaim, compact_gap(sc->order));
+	if (sc->nr_reclaimed >= max(sc->nr_to_reclaim, compact_gap(sc->order)))
+		return true;
+
+	/* kswapd should abort if all eligible zones are safe */
+	if (!current_is_kswapd())
+		return false;
+
+	mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
+	       WMARK_PROMO : WMARK_HIGH;
+
+	for (i = 0; i <= sc->reclaim_idx; i++) {
+		struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
+		unsigned long size = wmark_pages(zone, mark);
+
+		if (managed_zone(zone) &&
+		    !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
+			return false;
+	}
+
+	return true;
 }
 
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
 	long nr_to_scan;
 	unsigned long scanned = 0;
-	unsigned long nr_to_reclaim = get_nr_to_reclaim(sc);
 	int swappiness = get_swappiness(lruvec, sc);
 
 	/* clean file folios are more likely to exist */
@@ -5364,7 +5385,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (scanned >= nr_to_scan)
 			break;
 
-		if (sc->nr_reclaimed >= nr_to_reclaim)
+		if (should_abort_scan(lruvec, sc))
 			break;
 
 		cond_resched();
@@ -5425,7 +5446,6 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
 	struct lru_gen_folio *lrugen;
 	struct mem_cgroup *memcg;
 	const struct hlist_nulls_node *pos;
-	unsigned long nr_to_reclaim = get_nr_to_reclaim(sc);
 
 	bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
 restart:
@@ -5458,7 +5478,7 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
 
 		rcu_read_lock();
 
-		if (sc->nr_reclaimed >= nr_to_reclaim)
+		if (should_abort_scan(lruvec, sc))
 			break;
 	}
 
@@ -5469,7 +5489,7 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
 
 	mem_cgroup_put(memcg);
 
-	if (sc->nr_reclaimed >= nr_to_reclaim)
+	if (!is_a_nulls(pos))
 		return;
 
 	/* restart if raced with lru_gen_rotate_memcg() */
-- 
2.42.0.869.gea05f2083d-goog


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux