On Tue, Nov 22, 2022 at 11:27:21PM +0000, Yosry Ahmed wrote: > During reclaim, mem_cgroup_calculate_protection() is used to determine > the effective protection (emin and elow) values of a memcg. The > protection of the reclaim target is ignored, but we cannot set their > effective protection to 0 due to a limitation of the current > implementation (see comment in mem_cgroup_protection()). Instead, > we leave their effective protection values unchaged, and later ignore it > in mem_cgroup_protection(). > > However, mem_cgroup_protection() is called later in > shrink_lruvec()->get_scan_count(), which is after the > mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a > result, the stale effective protection values of the target memcg may > lead us to skip reclaiming from the target memcg entirely, before > calling shrink_lruvec(). This can be even worse with recursive > protection, where the stale target memcg protection can be higher than > its standalone protection. > > An example where this can happen is as follows. Consider the following > hierarchy with memory_recursiveprot: > ROOT > | > A (memory.min = 50M) > | > B (memory.min = 10M, memory.high = 40M) > > Consider the following scenarion: > - B has memory.current = 35M. > - The system undergoes global reclaim (target memcg is NULL). > - B will have an effective min of 50M (all of A's unclaimed protection). > - B will not be reclaimed from. > - Now allocate 10M more memory in B, pushing it above it's high limit. > - The system undergoes memcg reclaim from B (target memcg is B) > - In shrink_node_memcgs(), we call mem_cgroup_calculate_protection(), > which immediately returns for B without doing anything, as B is the > target memcg, relying on mem_cgroup_protection() to ignore B's stale > effective min (still 50M). > - Directly after mem_cgroup_calculate_protection(), we will call > mem_cgroup_below_min(), which will read the stale effective min for B > and skip it (instead of ignoring its protection as intended). In this > case, it's really bad because we are not just considering B's > standalone protection (10M), but we are reading a much higher stale > protection (50M) which will cause us to not reclaim from B at all. > > This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple > e{low,min} state mutations from protection checks") which made > mem_cgroup_calculate_protection() only change the state without > returning any value. Before that commit, we used to return > MEMCG_PROT_NONE for the target memcg, which would cause us to skip the > mem_cgroup_below_{min/low}() checks. After that commit we do not return > anything and we end up checking the min & low effective protections for > the target memcg, which are stale. > > Add mem_cgroup_ignore_protection() that checks if we are reclaiming from > the target memcg, and call it in mem_cgroup_below_{min/low}() to ignore > the stale protection of the target memcg. > > Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks") > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> Great catch! The fix looks good to me, only a couple of cosmetic suggestions. > --- > include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------ > mm/vmscan.c | 11 ++++++----- > 2 files changed, 33 insertions(+), 11 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index e1644a24009c..22c9c9f9c6b1 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -625,18 +625,32 @@ static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg) > > } > > -static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg) > +static inline bool mem_cgroup_ignore_protection(struct mem_cgroup *target, > + struct mem_cgroup *memcg) > { > - if (!mem_cgroup_supports_protection(memcg)) How about to merge mem_cgroup_supports_protection() and your new helper into something like mem_cgroup_possibly_protected()? It seems like they never used separately and unlikely ever will be used. Also, I'd swap target and memcg arguments. Thank you! PS If it's not too hard, please, consider adding a new kselftest to cover this case. Thank you!