Re: [PATCH] mm,oom: Teach lockdep about oom_lock.

Michal Hocko <mhocko@xxxxxxxxxx> · Tue, 12 Mar 2019 16:31:40 +0100

On Tue 12-03-19 23:06:33, Tetsuo Handa wrote:
[...]
> >From 250bbe28bc3e9946992d960bb90a351a896a543b Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> Date: Tue, 12 Mar 2019 22:58:41 +0900
> Subject: [PATCH v3] mm,oom: Teach lockdep about oom_lock.
> 
> Since a thread which succeeded to hold oom_lock must not involve blocking
> memory allocations, teach lockdep to consider that blocking memory
> allocations might wait for oom_lock at as early location as possible.
>
> Lockdep can't detect possibility of deadlock when mutex_trylock(&oom_lock)
> failed, for we assume that somebody else is still able to make a forward
> progress. Thus, teach lockdep to consider that mutex_trylock(&oom_lock) as
> mutex_lock(&oom_lock).
> 
> Since the OOM killer is disabled when __oom_reap_task_mm() is in progress,
> a thread which is calling __oom_reap_task_mm() must not involve blocking
> memory allocations. Thus, teach lockdep about that.
> 
> This patch should not cause lockdep splats unless there is somebody doing
> dangerous things (e.g. from OOM notifiers, from the OOM reaper).

This is obviously subjective but I find this still really hard to
understand.  What about
"
OOM killer path which holds oom_lock is not allowed to invoke any
blocking allocation because this might lead to a deadlock because any
forward progress is blocked because __alloc_pages_may_oom always bail
out on the trylock and the current lockdep implementation doesn't
recognize that as a potential deadlock. Teach the lockdep infrastructure
about this dependency and warn as early in the allocation path as
possible.

The implementation basically mimics GFP_NOFS/GFP_NOIO lockdep
implementation except that oom_lock dependency map is used directly.

Please note that oom_reaper guarantees a forward progress in case the
oom victim cannot exit on its own and as such it cannot depend on any
blocking allocation as well. Therefore mark its execution as if it was
holding the oom_lock as well.
"
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> ---
>  include/linux/oom.h | 12 ++++++++++++
>  mm/oom_kill.c       | 28 +++++++++++++++++++++++++++-
>  mm/page_alloc.c     | 16 ++++++++++++++++
>  3 files changed, 55 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index d079920..04aa46b 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -56,6 +56,18 @@ struct oom_control {
>  
>  extern struct mutex oom_lock;
>  
> +static inline void oom_reclaim_acquire(gfp_t gfp_mask)
> +{
> +	if (gfp_mask & __GFP_DIRECT_RECLAIM)
> +		mutex_acquire(&oom_lock.dep_map, 0, 0, _THIS_IP_);
> +}
> +
> +static inline void oom_reclaim_release(gfp_t gfp_mask)
> +{
> +	if (gfp_mask & __GFP_DIRECT_RECLAIM)
> +		mutex_release(&oom_lock.dep_map, 1, _THIS_IP_);
> +}
> +
>  static inline void set_current_oom_origin(void)
>  {
>  	current->signal->oom_flag_origin = true;
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3a24848..6f53bb6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -513,6 +513,14 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
>  	 */
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>  
> +	/*
> +	 * Since this function acts as a guarantee of a forward progress,
> +	 * current thread is not allowed to involve (even indirectly via
> +	 * dependency) __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocation from

As already pointed out in the previous email, the less specific about
gfp flags you are the better longterm. I would stick with a "blocking
allocation"

> +	 * this function, for such allocation will have to wait for this
> +	 * function to complete when __alloc_pages_may_oom() is called.
> +	 */
> +	oom_reclaim_acquire(GFP_KERNEL);
>  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
>  		if (!can_madv_dontneed_vma(vma))
>  			continue;
> @@ -544,6 +552,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
>  			tlb_finish_mmu(&tlb, range.start, range.end);
>  		}
>  	}
> +	oom_reclaim_release(GFP_KERNEL);
>  
>  	return ret;
>  }
> @@ -1120,8 +1129,25 @@ void pagefault_out_of_memory(void)
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
>  
> -	if (!mutex_trylock(&oom_lock))
> +	if (!mutex_trylock(&oom_lock)) {
> +		/*
> +		 * This corresponds to prepare_alloc_pages(). Lockdep will
> +		 * complain if e.g. OOM notifier for global OOM by error
> +		 * triggered pagefault OOM path.
> +		 */
> +		oom_reclaim_acquire(GFP_KERNEL);
> +		oom_reclaim_release(GFP_KERNEL);
>  		return;
> +	}
> +	/*
> +	 * Teach lockdep to consider that current thread is not allowed to
> +	 * involve (even indirectly via dependency) __GFP_DIRECT_RECLAIM &&
> +	 * !__GFP_NORETRY allocation from this function, for such allocation
> +	 * will have to wait for completion of this function when
> +	 * __alloc_pages_may_oom() is called.
> +	 */
> +	oom_reclaim_release(GFP_KERNEL);
> +	oom_reclaim_acquire(GFP_KERNEL);

This part is not really clear to me. Why do you release&acquire when
mutex_trylock just acquire the lock? If this is really needed then this
should be put into the comment.

>  	out_of_memory(&oc);
>  	mutex_unlock(&oom_lock);
>  }

-- 
Michal Hocko
SUSE Labs