On 03/20/2018 06:29 PM, Michal Hocko wrote: >> Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole >> pgdat, so it's reasonable to leave all decisions about node stat >> to kswapd. Also add per-cgroup congestion state to avoid needlessly >> burning CPU in cgroup reclaim if heavy congestion is observed. >> >> Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY >> bits since they alter only kswapd behavior. >> >> The problem could be easily demonstrated by creating heavy congestion >> in one cgroup: >> >> echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control >> mkdir -p /sys/fs/cgroup/congester >> echo 512M > /sys/fs/cgroup/congester/memory.max >> echo $$ > /sys/fs/cgroup/congester/cgroup.procs >> /* generate a lot of diry data on slow HDD */ >> while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done & >> .... >> while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done & >> >> and some job in another cgroup: >> >> mkdir /sys/fs/cgroup/victim >> echo 128M > /sys/fs/cgroup/victim/memory.max >> >> # time cat /dev/sda > /dev/null >> real 10m15.054s >> user 0m0.487s >> sys 1m8.505s >> >> According to the tracepoint in wait_iff_congested(), the 'cat' spent 50% >> of the time sleeping there. >> >> With the patch, cat don't waste time anymore: >> >> # time cat /dev/sda > /dev/null >> real 5m32.911s >> user 0m0.411s >> sys 0m56.664s >> >> Signed-off-by: Andrey Ryabinin <aryabinin@xxxxxxxxxxxxx> >> --- >> include/linux/backing-dev.h | 2 +- >> include/linux/memcontrol.h | 2 ++ >> mm/backing-dev.c | 19 ++++------ >> mm/vmscan.c | 84 ++++++++++++++++++++++++++++++++------------- >> 4 files changed, 70 insertions(+), 37 deletions(-) > > This patch seems overly complicated. Why don't you simply reduce the whole > pgdat_flags handling to global_reclaim()? > In that case cgroup2 reclaim wouldn't have any way of throttling if cgroup is full of congested dirty pages.