On Wed, Nov 23, 2022 at 4:40 PM Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote: > > On Wed, Nov 23, 2022 at 09:21:30AM +0000, Yosry Ahmed wrote: > > During reclaim, mem_cgroup_calculate_protection() is used to determine > > the effective protection (emin and elow) values of a memcg. The > > protection of the reclaim target is ignored, but we cannot set their > > effective protection to 0 due to a limitation of the current > > implementation (see comment in mem_cgroup_protection()). Instead, > > we leave their effective protection values unchaged, and later ignore it > > in mem_cgroup_protection(). > > > > However, mem_cgroup_protection() is called later in > > shrink_lruvec()->get_scan_count(), which is after the > > mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a > > result, the stale effective protection values of the target memcg may > > lead us to skip reclaiming from the target memcg entirely, before > > calling shrink_lruvec(). This can be even worse with recursive > > protection, where the stale target memcg protection can be higher than > > its standalone protection. See two examples below (a similar version of > > example (a) is added to test_memcontrol in a later patch). > > > > (a) A simple example with proactive reclaim is as follows. Consider the > > following hierarchy: > > ROOT > > | > > A > > | > > B (memory.min = 10M) > > > > Consider the following scenario: > > - B has memory.current = 10M. > > - The system undergoes global reclaim (or memcg reclaim in A). > > - In shrink_node_memcgs(): > > - mem_cgroup_calculate_protection() calculates the effective min (emin) > > of B as 10M. > > - mem_cgroup_below_min() returns true for B, we do not reclaim from B. > > - Now if we want to reclaim 5M from B using proactive reclaim > > (memory.reclaim), we should be able to, as the protection of the > > target memcg should be ignored. > > - In shrink_node_memcgs(): > > - mem_cgroup_calculate_protection() immediately returns for B without > > doing anything, as B is the target memcg, relying on > > mem_cgroup_protection() to ignore B's stale effective min (still 10M). > > - mem_cgroup_below_min() reads the stale effective min for B and we > > skip it instead of ignoring its protection as intended, as we never > > reach mem_cgroup_protection(). > > > > (b) An more complex example with recursive protection is as follows. > > Consider the following hierarchy with memory_recursiveprot: > > ROOT > > | > > A (memory.min = 50M) > > | > > B (memory.min = 10M, memory.high = 40M) > > > > Consider the following scenario: > > - B has memory.current = 35M. > > - The system undergoes global reclaim (target memcg is NULL). > > - B will have an effective min of 50M (all of A's unclaimed protection). > > - B will not be reclaimed from. > > - Now allocate 10M more memory in B, pushing it above it's high limit. > > - The system undergoes memcg reclaim from B (target memcg is B). > > - Like example (a), we do nothing in mem_cgroup_calculate_protection(), > > then call mem_cgroup_below_min(), which will read the stale effective > > min for B (50M) and skip it. In this case, it's even worse because we > > are not just considering B's standalone protection (10M), but we are > > reading a much higher stale protection (50M) which will cause us to not > > reclaim from B at all. > > > > This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple > > e{low,min} state mutations from protection checks") which made > > mem_cgroup_calculate_protection() only change the state without > > returning any value. Before that commit, we used to return > > MEMCG_PROT_NONE for the target memcg, which would cause us to skip the > > mem_cgroup_below_{min/low}() checks. After that commit we do not return > > anything and we end up checking the min & low effective protections for > > the target memcg, which are stale. > > > > Update mem_cgroup_supports_protection() to also check if we are > > reclaiming from the target, and rename it to mem_cgroup_unprotected() > > (now returns true if we should not protect the memcg, much simpler logic). > > > > Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks") > > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > Reviewed-by: Roman Gushchin <roman.gushchin@xxxxxxxxx> > > Thank you! Thanks for reviewing! Do you think we need a CC to stable here?