On Fri, Jan 22, 2021 at 10:43 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > This reverts commit 536d3bf261a2fc3b05b3e91e7eef7383443015cf, as it > can cause writers to memory.high to get stuck in the kernel forever, > performing page reclaim and consuming excessive amounts of CPU cycles. > > Before the patch, a write to memory.high would first put the new limit > in place for the workload, and then reclaim the requested delta. After > the patch, the kernel tries to reclaim the delta before putting the > new limit into place, in order to not overwhelm the workload with a > sudden, large excess over the limit. However, if reclaim is actively > racing with new allocations from the uncurbed workload, it can keep > the write() working inside the kernel indefinitely. > > This is causing problems in Facebook production. A privileged > system-level daemon that adjusts memory.high for various workloads > running on a host can get unexpectedly stuck in the kernel and > essentially turn into a sort of involuntary kswapd for one of the > workloads. We've observed that daemon busy-spin in a write() for > minutes at a time, neglecting its other duties on the system, and > expending privileged system resources on behalf of a workload. > > To remedy this, we have first considered changing the reclaim logic to > break out after a couple of loops - whether the workload has converged > to the new limit or not - and bound the write() call this way. > However, the root cause that inspired the sequence change in the first > place has been fixed through other means, and so a revert back to the > proven limit-setting sequence, also used by memory.max, is preferable. > > The sequence was changed to avoid extreme latencies in the workload > when the limit was lowered: the sudden, large excess created by the > limit lowering would erroneously trigger the penalty sleeping code > that is meant to throttle excessive growth from below. Allocating > threads could end up sleeping long after the write() had already > reclaimed the delta for which they were being punished. > > However, erroneous throttling also caused problems in other scenarios > at around the same time. This resulted in commit b3ff92916af3 ("mm, > memcg: reclaim more aggressively before high allocator throttling"), > included in the same release as the offending commit. When allocating > threads now encounter large excess caused by a racing write() to > memory.high, instead of entering punitive sleeps, they will simply be > tasked with helping reclaim down the excess, and will be held no > longer than it takes to accomplish that. This is in line with regular > limit enforcement - i.e. if the workload allocates up against or over > an otherwise unchanged limit from below. > > With the patch breaking userspace, and the root cause addressed by > other means already, revert it again. > > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high") > Cc: <stable@xxxxxxxxxxxxxxx> # 5.8+ > Reported-by: Tejun Heo <tj@xxxxxxxxxx> > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> Reviewed-by: Shakeel Butt <shakeelb@xxxxxxxxxx>