On Mon, Jun 3, 2019 at 2:59 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > When applications are put into unconfigured cgroups for memory > accounting purposes, the cgrouping itself should not change the > behavior of the page reclaim code. We expect the VM to reclaim the > coldest pages in the system. But right now the VM can reclaim hot > pages in one cgroup while there is eligible cold cache in others. > > This is because one part of the reclaim algorithm isn't truly cgroup > hierarchy aware: the inactive/active list balancing. That is the part > that is supposed to protect hot cache data from one-off streaming IO. > > The recursive cgroup reclaim scheme will scan and rotate the physical > LRU lists of each eligible cgroup at the same rate in a round-robin > fashion, thereby establishing a relative order among the pages of all > those cgroups. However, the inactive/active balancing decisions are > made locally within each cgroup, so when a cgroup is running low on > cold pages, its hot pages will get reclaimed - even when sibling > cgroups have plenty of cold cache eligible in the same reclaim run. > > For example: > > [root@ham ~]# head -n1 /proc/meminfo > MemTotal: 1016336 kB > > [root@ham ~]# ./reclaimtest2.sh > Establishing 50M active files in cgroup A... > Hot pages cached: 12800/12800 workingset-a > Linearly scanning through 18G of file data in cgroup B: > real 0m4.269s > user 0m0.051s > sys 0m4.182s > Hot pages cached: 134/12800 workingset-a > Can you share reclaimtest2.sh as well? Maybe a selftest to monitor/test future changes. > The streaming IO in B, which doesn't benefit from caching at all, > pushes out most of the workingset in A. > > Solution > > This series fixes the problem by elevating inactive/active balancing > decisions to the toplevel of the reclaim run. This is either a cgroup > that hit its limit, or straight-up global reclaim if there is physical > memory pressure. From there, it takes a recursive view of the cgroup > subtree to decide whether page deactivation is necessary. > > In the test above, the VM will then recognize that cgroup B has plenty > of eligible cold cache, and that thet hot pages in A can be spared: > > [root@ham ~]# ./reclaimtest2.sh > Establishing 50M active files in cgroup A... > Hot pages cached: 12800/12800 workingset-a > Linearly scanning through 18G of file data in cgroup B: > real 0m4.244s > user 0m0.064s > sys 0m4.177s > Hot pages cached: 12800/12800 workingset-a > > Implementation > > Whether active pages can be deactivated or not is influenced by two > factors: the inactive list dropping below a minimum size relative to > the active list, and the occurence of refaults. > > After some cleanups and preparations, this patch series first moves > refault detection to the reclaim root, then enforces the minimum > inactive size based on a recursive view of the cgroup tree's LRUs. > > History > > Note that this actually never worked correctly in Linux cgroups. In > the past it worked for global reclaim and leaf limit reclaim only (we > used to have two physical LRU linkages per page), but it never worked > for intermediate limit reclaim over multiple leaf cgroups. > > We're noticing this now because 1) we're putting everything into > cgroups for accounting, not just the things we want to control and 2) > we're moving away from leaf limits that invoke reclaim on individual > cgroups, toward large tree reclaim, triggered by high-level limits or > physical memory pressure, that is influenced by local protections such > as memory.low and memory.min instead. > > Requirements > > These changes are based on the fast recursive memcg stats merged in > 5.2-rc1. The patches are against v5.2-rc2-mmots-2019-05-29-20-56-12 > plus the page cache fix in https://lkml.org/lkml/2019/5/24/813. > > include/linux/memcontrol.h | 37 +-- > include/linux/mmzone.h | 30 +- > include/linux/swap.h | 2 +- > mm/memcontrol.c | 6 +- > mm/page_alloc.c | 2 +- > mm/vmscan.c | 667 ++++++++++++++++++++++--------------------- > mm/workingset.c | 74 +++-- > 7 files changed, 437 insertions(+), 381 deletions(-) > >