On 2019/03/08 20:58, Michal Hocko wrote: > OK, that makes sense to me. I cannot judge the implementation because I > am not really familiar with lockdep machinery. Could you explain how it > doesn't trigger for all other allocations? This is same with why fs_reclaim_acquire()/fs_reclaim_release() doesn't trigger for all other allocations. Any allocation request which might involve __GFP_FS reclaim passes "struct lockdep_map __fs_reclaim_map", and lockdep records it. > > Also why it is not sufficient to add the lockdep annotation prior to the > trylock in __alloc_pages_may_oom? This is same with why fs_reclaim_acquire()/fs_reclaim_release() is called from prepare_alloc_pages(). If an allocation request which might involve __GFP_FS __perform_reclaim() succeeded before actually calling __perform_reclaim(), we fail to pass "struct lockdep_map __fs_reclaim_map" (which makes it difficult to check whether there is possibility of deadlock). Likewise, if an allocation request which might call __alloc_pages_may_oom() succeeded before actually calling __alloc_pages_may_oom(), we fail to pass oom_lock.lockdep_map (which makes it difficult to check whether there is possibility of deadlock). Strictly speaking, there is if (tsk_is_oom_victim(current) && (alloc_flags == ALLOC_OOM || (gfp_mask & __GFP_NOMEMALLOC))) goto nopage; case where failing to hold oom_lock at __alloc_pages_may_oom() does not cause a problem. But I think that we should not check tsk_is_oom_victim() at prepare_alloc_pages(). > It would be also great to pull it out of the code flow and hide it > behind a helper static inline. Something like > lockdep_track_oom_alloc_reentrant or a like. OK. Here is v2 patch. >From ec8d0accf15b4566c065ca8c63a4e1185f0a0c78 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> Date: Sat, 9 Mar 2019 09:55:08 +0900 Subject: [PATCH v2] mm,oom: Teach lockdep about oom_lock. Since a thread which succeeded to hold oom_lock must not involve blocking memory allocations, teach lockdep to consider that blocking memory allocations might wait for oom_lock at as early location as possible, and teach lockdep to consider that oom_lock is held by mutex_lock() than by mutex_trylock(). Also, since the OOM killer is disabled until the OOM reaper or exit_mmap() sets MMF_OOM_SKIP, teach lockdep to consider that oom_lock is held when __oom_reap_task_mm() is called. This patch should not cause lockdep splats unless there is somebody doing dangerous things (e.g. from OOM notifiers, from the OOM reaper). Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> --- include/linux/oom.h | 16 ++++++++++++++++ mm/oom_kill.c | 9 ++++++++- mm/page_alloc.c | 5 +++++ 3 files changed, 29 insertions(+), 1 deletion(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index d079920..8544c23 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -56,6 +56,22 @@ struct oom_control { extern struct mutex oom_lock; +static inline void oom_reclaim_acquire(gfp_t gfp_mask, unsigned int order) +{ + if ((gfp_mask & __GFP_DIRECT_RECLAIM) && !(gfp_mask & __GFP_NORETRY) && + (!(gfp_mask & __GFP_RETRY_MAYFAIL) || + order <= PAGE_ALLOC_COSTLY_ORDER)) + mutex_acquire(&oom_lock.dep_map, 0, 0, _THIS_IP_); +} + +static inline void oom_reclaim_release(gfp_t gfp_mask, unsigned int order) +{ + if ((gfp_mask & __GFP_DIRECT_RECLAIM) && !(gfp_mask & __GFP_NORETRY) && + (!(gfp_mask & __GFP_RETRY_MAYFAIL) || + order <= PAGE_ALLOC_COSTLY_ORDER)) + mutex_release(&oom_lock.dep_map, 1, _THIS_IP_); +} + static inline void set_current_oom_origin(void) { current->signal->oom_flag_origin = true; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 3a24848..11be7da 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -513,6 +513,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm) */ set_bit(MMF_UNSTABLE, &mm->flags); + oom_reclaim_acquire(GFP_KERNEL, 0); for (vma = mm->mmap ; vma; vma = vma->vm_next) { if (!can_madv_dontneed_vma(vma)) continue; @@ -544,6 +545,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm) tlb_finish_mmu(&tlb, range.start, range.end); } } + oom_reclaim_release(GFP_KERNEL, 0); return ret; } @@ -1120,8 +1122,13 @@ void pagefault_out_of_memory(void) if (mem_cgroup_oom_synchronize(true)) return; - if (!mutex_trylock(&oom_lock)) + if (!mutex_trylock(&oom_lock)) { + oom_reclaim_acquire(GFP_KERNEL, 0); + oom_reclaim_release(GFP_KERNEL, 0); return; + } + oom_reclaim_release(GFP_KERNEL, 0); + oom_reclaim_acquire(GFP_KERNEL, 0); out_of_memory(&oc); mutex_unlock(&oom_lock); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6d0fa5b..e8853a19 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3793,6 +3793,8 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) schedule_timeout_uninterruptible(1); return NULL; } + oom_reclaim_release(gfp_mask, order); + oom_reclaim_acquire(gfp_mask, order); /* * Go through the zonelist yet one more time, keep very high watermark @@ -4651,6 +4653,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, fs_reclaim_acquire(gfp_mask); fs_reclaim_release(gfp_mask); + oom_reclaim_acquire(gfp_mask, order); + oom_reclaim_release(gfp_mask, order); + might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM); if (should_fail_alloc_page(gfp_mask, order)) -- 1.8.3.1