Re: [PATCH] mm, oom: check memcg margin for parallel oom

Michal Hocko <mhocko@xxxxxxxxxx> · Tue, 14 Jul 2020 14:37:26 +0200

On Tue 14-07-20 08:08:32, Yafang Shao wrote:
> The commit 7775face2079 ("memcg: killed threads should not invoke memcg OOM
> killer") resolves the problem that different threads in a multi-threaded
> task doing parallel memcg oom, but it doesn't solve the problem that
> different tasks doing parallel memcg oom.
> 
> It may happens that many different tasks in the same memcg are waiting
> oom_lock at the same time, if one of them has already made progress and
> freed enough available memory, the others don't need to trigger the oom
> killer again. By checking memcg margin after hold oom_lock can help
> achieve it.

While the changelog makes sense it I believe it can be improved. I would
use something like the following. Feel free to use its parts.

"
Memcg oom killer invocation is synchronized by the global oom_lock and
tasks are sleeping on the lock while somebody is selecting the victim or
potentially race with the oom_reaper is releasing the victim's memory.
This can result in a pointless oom killer invocation because a waiter
might be racing with the oom_reaper

	P1		oom_reaper		P2
			oom_reap_task		mutex_lock(oom_lock)
						out_of_memory # no victim because we have one already
			__oom_reap_task_mm	mute_unlock(oom_lock)
mutex_lock(oom_lock)
			set MMF_OOM_SKIP
select_bad_process
# finds a new victim

The page allocator prevents from this race by trying to allocate after
the lock can be acquired (in __alloc_pages_may_oom) which acts as a last
minute check. Moreover page allocator simply doesn't block on the
oom_lock and simply retries the whole reclaim process.

Memcg oom killer should do the last minute check as well. Call
mem_cgroup_margin to do that. Trylock on the oom_lock could be done as
well but this doesn't seem to be necessary at this stage.
"

> Suggested-by: Michal Hocko <mhocko@xxxxxxxxxx>
> Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> Cc: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> Cc: David Rientjes <rientjes@xxxxxxxxxx>
> ---
>  mm/memcontrol.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1962232..df141e1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1560,16 +1560,31 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		.gfp_mask = gfp_mask,
>  		.order = order,
>  	};
> -	bool ret;
> +	bool ret = true;
>  
>  	if (mutex_lock_killable(&oom_lock))
>  		return true;
> +
>  	/*
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = should_force_charge() || out_of_memory(&oc);
> +	if (should_force_charge())
> +		goto out;
> +
> +	/*
> +	 * Different tasks may be doing parallel oom, so after hold the
> +	 * oom_lock the task should check the memcg margin again to check
> +	 * whether other task has already made progress.
> +	 */
> +	if (mem_cgroup_margin(memcg) >= (1 << order))
> +		goto out;

Is there any reason why you simply haven't done this? (+ your comment
which is helpful).

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 248e6cad0095..2c176825efe3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1561,15 +1561,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		.order = order,
 		.chosen_points = LONG_MIN,
 	};
-	bool ret;
+	bool ret = true;
 
-	if (mutex_lock_killable(&oom_lock))
+	if (!mutex_trylock(&oom_lock))
 		return true;
+
+	if (mem_cgroup_margin(memcg) >= (1 << order))
+		goto unlock;
+
 	/*
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
 	ret = should_force_charge() || out_of_memory(&oc);
+
+unlock:
 	mutex_unlock(&oom_lock);
 	return ret;
 }
> +
> +	ret = out_of_memory(&oc);
> +
> +out:
>  	mutex_unlock(&oom_lock);
> +
>  	return ret;
>  }
>  
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs