On Wed 27-02-19 19:39:19, Tetsuo Handa wrote: > On 2019/02/27 18:21, Michal Hocko wrote: > > On Wed 27-02-19 12:43:51, Tetsuo Handa wrote: > >> I noticed that when a kdump kernel triggers the OOM killer because a too > >> small value was given to crashkernel= parameter, the OOM reaper tends to > >> fail to reclaim memory from OOM victims because they are in dup_mm() from > >> copy_mm() from copy_process() with mmap_sem held for write. > > > > I would presume that a page table allocation would fail for the oom > > victim as soon as the oom memory reserves get depleted and then > > copy_page_range would bail out and release the lock. That being > > said, the oom_reaper might bail out before then but does sprinkling > > fatal_signal_pending checks into copy_*_range really help reliably? > > > > Yes, I think so. The OOM victim was just sleeping at might_sleep_if() > rather than continue allocations until ALLOC_OOM allocation fails. > Maybe the kdump kernel enables only one CPU somehow contributed that > the OOM reaper gave up before ALLOC_OOM allocation fails. But if the OOM > victim in a normal kernel had huge memory mapping where p?d_alloc() is > called for so many times, and kernel frequently prevented the OOM victim > from continuing ALLOC_OOM allocations, it might not be rare cases (I > don't have a huge machine for testing intensive p?d_alloc() loop) to > hit this problem. We cannot do anything about the preemption so that is moot. ALLOC_OOM reserve is limited so the failure should happen sooner or later. But I would be OK to check for fatal_signal_pending once per pmd or so if that helps and it doesn't add a noticeable overhead. > Technically, it would be possible to use a per task_struct flag > which allows __alloc_pages_nodemask() to check early and bail out: > > down_write(¤t->mm->mmap_sem); > current->no_oom_alloc = 1; > while (...) { > p?d_alloc(); > } > current->no_oom_alloc = 0; > up_write(¤t->mm->mmap_sem); Looks like a hack to me. We already do have __GFP_NOMEMALLOC, __GFP_MEMALLOC and PF_MEMALLOC and you want yet another way to control access to reserves. This is a mess. If anything then PF_NOMEMALLOC would be a better fit but the flag space is quite tight already. Besides that is this really worth doing when the caller can bail out? -- Michal Hocko SUSE Labs