Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap

Balbir Singh <balbirs@xxxxxxxxxx> · Fri, 13 Dec 2024 13:54:10 +1100

On 12/13/24 07:00, Rik van Riel wrote:
> On Thu, 12 Dec 2024 18:31:57 +0000
> Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote:
> 
>> Is it about a single task or groups of tasks or the entire cgroup?
>> If former, why it's a problem? A tight memcg limit can slow things down
>> in general and I don't see why we should treat the exit() path differently.
>>
> I think the exit path does need to be treated a little differently,
> since this exit may be the only way such a cgroup can free up memory.
> 
>> If it's about the entire cgroup and we have essentially a deadlock,
>> I feel like we need to look into the oom reaper side.
> 
> You mean something like the below?
> 
> I have not tested it yet, because we don't have any stuck
> cgroups right now among the workloads that I'm monitoring.
> 
> ---8<---
> 
> From c0e545fd45bd3ee24cd79b3d3e9b375e968ef460 Mon Sep 17 00:00:00 2001
> From: Rik van Riel <riel@xxxxxxxxxxx>
> Date: Thu, 12 Dec 2024 14:50:49 -0500
> Subject: [PATCH] memcg,oom: speed up reclaim for exiting tasks
> 
> When a memcg reaches its memory limit, and reclaim becomes unavailable
> or slow for some reason, for example only zswap is available, but zswap
> writeback is disabled, it can take a long time for tasks to exit, and
> for the cgroup to get back to normal (or cleaned up).
> 
> Speed up memcg reclaim for exiting tasks by limiting how much work
> reclaim does, and by invoking the OOM reaper if reclaim does not
> free up enough memory to allow the task to make progress.
> 
> Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
> ---
>  include/linux/oom.h |  8 ++++++++
>  mm/memcontrol.c     | 11 +++++++++++
>  mm/oom_kill.c       |  6 +-----
>  3 files changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 1e0fc6931ce9..b2d9cf936664 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -111,4 +111,12 @@ extern void oom_killer_enable(void);
>  
>  extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>  
> +#ifdef CONFIG_MMU
> +extern void queue_oom_reaper(struct task_struct *tsk);
> +#else
> +static intern void queue_oom_reaper(struct task_struct *tsk)
> +{
> +}
> +#endif
> +
>  #endif /* _INCLUDE_LINUX_OOM_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b3503d12aaf..21f42758d430 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2231,6 +2231,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (!gfpflags_allow_blocking(gfp_mask))
>  		goto nomem;
>  
> +	if (unlikely(current->flags & PF_EXITING))
> +		gfp_mask |= __GFP_NORETRY;
> +

if (task_is_dying())
	gfp_mask |= __GFP_NORETRY

>  	memcg_memory_event(mem_over_limit, MEMCG_MAX);
>  	raised_max_event = true;
>  
> @@ -2284,6 +2287,14 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		goto retry;
>  	}
>  nomem:
> +	/*
> +	 * We ran out of memory while inside exit. Maybe the OOM
> +	 * reaper can help reduce cgroup memory use and get things
> +	 * moving again?
> +	 */
> +	if (unlikely(current->flags & PF_EXITING))
> +		queue_oom_reaper(current);
> +

I am not sure this is helpful without task_will_free_mem(), the dying
task shouldn't get sent to the OOM killer when we run out of memory.

I did notice that we have heuristics around task_is_dying() and
passed_oom, sounds like the end result of your changes would be to
ignore the heuristics of passed_oom

Balbir Singh.