Re: [PATCH v2 2/5] cgroup: Account for memory_recursiveprot in test_memcg_low()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello all.

On Mon, May 09, 2022 at 05:44:24PM -0700, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> So I think we're OK with [2/5] now.  Unless there be objections, I'll
> be looking to get this series into mm-stable later this week.

I'm sorry, I think the current form of the test reveals an unexpected
behavior of reclaim and silencing the test is not the way to go.
Although, I may be convinced that my understanding is wrong.


On Mon, May 09, 2022 at 11:09:15AM -0400, Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> My understanding of the issue you're raising, Michal, is that
> protected siblings start with current > low, then get reclaimed
> slightly too much and end up with current < low. This results in a
> tiny bit of float that then gets assigned to the low=0 sibling; 

Up until here, we're on the same page.

> when that sibling gets reclaimed regardless, it sees a low event.
> Correct me if I missed a detail or nuance here.

Here, I'd like to stress that the event itself is just a messenger (whom
my original RFC patch attempted to get rid of). The problem is that if
the sibling with recursive protection is active enough to claim it, it's
effectively stolen from the passive sibling. See the comparison of
'precious' vs 'victim' in [1].

> But unused float going to siblings is intentional. This is documented
> in point 3 in the comment above effective_protection(): if you use
> less than you're legitimately claiming, the float goes to your
> siblings.

The problem is how the unused protection came to be (voluntarily not
consumed vs reclaimed).

> So the problem doesn't seem to be with low accounting and
> event generation, but rather it's simply overreclaim.

Exactly.

> It's conceivable to make reclaim more precise and then tighten up the
> test. But right now, David's patch looks correct to me.

The obvious fix is at the end of this message, it resolves the case I
posted earlier (with memory_recursiveprot), however, it "breaks"
memory.events:low accounting inside recursive children, hence I'm not
considering it finished. (I may elaborate on the breaking case if
interested, I also need to look more into that myself).


On Fri, May 06, 2022 at 09:40:15AM -0700, David Vernet <void@xxxxxxxxxxxxx> wrote:
> If you look at how much memory A/B/E gets at the end of the reclaim,
> it's still far less than 1MB (though should it be 0?).

This selftest has two ±equal workloads in siblings, however, if their
activity varies, it can end up even opposite (the example [1]).

> This definitely sounds to me like a useful testcase to add, and I'm
> happy to do so in a follow-on patch. If we added this, do you think
> we need to keep the check for memory.low events for the memory.low ==
> 0 child in the overcommit testcase?

I think it's still useful, to check the behavior when inherited vs
explicit siblings coexist under protected parent.
Actually, the second case of all siblings having the inherited
(implicit) protection is also interesting (it seems that's that I'm
seeing in my tests with the attached patch).

+Cc: Chris, who reasoned about the SWAP_CLUSTER_MAX rounding vs too high
priority (too low numerically IIUC) [2].

Michal

[1] https://lore.kernel.org/r/20220325103118.GC2828@xxxxxxxxxxxxxxxxx/
[2] https://lore.kernel.org/all/20190128214213.GB15349@xxxxxxxxxxxxxx/

--- 8< ---
>From e18caf7a5a1b0f39185fbdc11e4034def42cde88 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Michal=20Koutn=C3=BD?= <mkoutny@xxxxxxxx>
Date: Tue, 10 May 2022 18:48:31 +0200
Subject: [RFC PATCH] mm: memcg: Do not overreclaim SWAP_CLUSTER_MAX from
 protected memcg
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This was observed with memcontrol selftest/new LTP test but can be also
reproduced in simplified setup of two siblings:

	`parent .low=50M
	  ` s1	.low=50M  .current=50M+ε
	  ` s2  .low=0M   .current=50M

The expectation is that s2/memory.events:low will be zero under outer
reclaimer since no protection should be given to cgroup s2 (even with
memory_recursiveprot).

However, this does not happen. The apparent reason is that when s1 is
considered for (proportional) reclaim the scanned proportion is rounded
up to SWAP_CLUSTER_MAX and slightly over-proportional amount is
reclaimed. Consequently, when the effective low value of s2 is
calculated, it observes unclaimed parent's protection from s1
(ε-SWAP_CLUSTER_MAX in theory) and effectively appropriates it.

What is worse, when the sibling s2 has more (memory) greedy workload, it
can repeatedly "steal" the protection from s1 and the distribution ends
up with s1 mostly reclaimed despite explicit prioritization over s2.

Simply fix it by _not_ rounding up to SWAP_CLUSTER_MAX. This would have
saved us ~5 levels of reclaim priority. I.e. we may be reclaiming from
protected memcgs at relatively low priority _without_ counting any
memory.events:low (due to overreclaim). Now, if the moderated scan is
not enough, we must bring priority to zero to open protected reserves.
And that's correct, we want to be explicit when reclaiming those.


Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection")
Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
Reported-by: Richard Palethorpe <rpalethorpe@xxxxxxxx>
Link: https://lore.kernel.org/all/20220321101429.3703-1-rpalethorpe@xxxxxxxx/
Signed-off-by: Michal Koutný <mkoutny@xxxxxxxx>
---
 mm/vmscan.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1678802e03e7..cd760842b9ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2798,13 +2798,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 			scan = lruvec_size - lruvec_size * protection /
 				(cgroup_size + 1);
-
-			/*
-			 * Minimally target SWAP_CLUSTER_MAX pages to keep
-			 * reclaim moving forwards, avoiding decrementing
-			 * sc->priority further than desirable.
-			 */
-			scan = max(scan, SWAP_CLUSTER_MAX);
 		} else {
 			scan = lruvec_size;
 		}
-- 
2.35.3






[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux