On Thu, Jan 11, 2024 at 5:29 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > While investigating hosts with high cgroup memory pressures, Tejun > found culprit zombie tasks that had were holding on to a lot of > memory, had SIGKILL pending, but were stuck in memory.high reclaim. > > In the past, we used to always force-charge allocations from tasks > that were exiting in order to accelerate them dying and freeing up > their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg: > prohibit unconditional exceeding the limit of dying tasks"); it noted > that this can cause (userspace inducable) containment failures, so it > added a mandatory reclaim and OOM kill cycle before forcing charges. > At the time, memory.high enforcement was handled in the userspace > return path, which isn't reached by dying tasks, and so memory.high > was still never enforced by dying tasks. > > When c9afe31ec443 ("memcg: synchronously enforce memory.high for large > overcharges") added synchronous reclaim for memory.high, it added > unconditional memory.high enforcement for dying tasks as well. The > callstack shows that this path is where the zombie is stuck in. > > We need to accelerate dying tasks getting past memory.high, but we > cannot do it quite the same way as we do for memory.max: memory.max is > enforced strictly, and tasks aren't allowed to move past it without > FIRST reclaiming and OOM killing if necessary. This ensures very small > levels of excess. With memory.high, though, enforcement happens lazily > after the charge, and OOM killing is never triggered. A lot of > concurrent threads could have pushed, or could actively be pushing, > the cgroup into excess. The dying task will enter reclaim on every > allocation attempt, with little hope of restoring balance. > > To fix this, skip synchronous memory.high enforcement on dying tasks > altogether again. Update memory.high path documentation while at it. > > Fixes: c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges") > Reported-by: Tejun Heo <tj@xxxxxxxxxx> > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> LGTM with a couple of nits below: Reviewed-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > --- > mm/memcontrol.c | 24 +++++++++++++++++++++--- > 1 file changed, 21 insertions(+), 3 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 73692cd8c142..aca879995022 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2603,8 +2603,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg, > } > > /* > - * Scheduled by try_charge() to be executed from the userland return path > - * and reclaims memory over the high limit. > + * Reclaims memory over the high limit. Called directly from > + * try_charge() when possible, but also scheduled to be called from > + * the userland return path where reclaim is always able to block. > */ nit: The term "scheduled" here is deceptive imo, it makes me think of queue_work() and friends, when it is directly called from resume_user_mode_work(). Can we change the terminology to "called from the userland return path" or directly reference resume_user_mode_work() instead? Same applies to the added comment below in try_charge_memcg(). nit: "when possible" is not entirely accurate, it makes it seem like we call mem_cgroup_handle_over_high() whenever we can (which means gfpflags_allow_blocking() imo). We actually choose not to call it in some situations, and this patch is adding one such situation. So perhaps "when possible and desirable" or just "when appropriate". > void mem_cgroup_handle_over_high(gfp_t gfp_mask) > { > @@ -2673,6 +2674,9 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask) > } > > /* > + * Reclaim didn't manage to push usage below the limit, slow > + * this allocating task down. > + * > * If we exit early, we're guaranteed to die (since > * schedule_timeout_killable sets TASK_KILLABLE). This means we don't > * need to account for any ill-begotten jiffies to pay them off later. > @@ -2867,8 +2871,22 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > } > } while ((memcg = parent_mem_cgroup(memcg))); > > + /* > + * Reclaim is scheduled for the userland return path already, > + * but also attempt synchronous reclaim to avoid excessive > + * overrun while the task is still inside the kernel. If this > + * is successful, the return path will see it when it rechecks > + * the overage, and simply bail out. > + * > + * Skip if the task is already dying, though. Unlike > + * memory.max, memory.high enforcement isn't as strict, and > + * there is no OOM killer involved, which means the excess > + * could already be much bigger (and still growing) than it > + * could for memory.max; the dying task could get stuck in > + * fruitless reclaim for a long time, which isn't desirable. > + */ > if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH && > - !(current->flags & PF_MEMALLOC) && > + !(current->flags & PF_MEMALLOC) && !task_is_dying() && > gfpflags_allow_blocking(gfp_mask)) { > mem_cgroup_handle_over_high(gfp_mask); > } > -- > 2.43.0 > >