On 2018/11/15 20:36, Michal Hocko wrote: > On Thu 15-11-18 18:54:15, Tetsuo Handa wrote: > > On 2018/11/14 19:16, Michal Hocko wrote: > > > On Wed 14-11-18 18:46:13, Tetsuo Handa wrote: > > > [...] > > > > There is always an invisible lock called "scheduling priority". You can't > > > > leave the MMF_OOM_SKIP to the exit path. Your approach is not ready for > > > > handling the worst case. > > > > > > And that problem is all over the memory reclaim. You can get starved > > > to death and block other resources. And the memory reclaim is not the > > > only one. > > > > I think that it is a manner for kernel developers that no thread keeps > > consuming CPU resources forever. In the kernel world, doing > > > > while (1); > > > > is not permitted. Likewise, doing > > > > for (i = 0; i < very_large_value; i++) > > do_something_which_does_not_yield_CPU_to_others(); > > There is nothing like that proposed in this series. > > > has to be avoided, in order to avoid lockup problems. We are required to > > yield CPU to others when we are waiting for somebody else to make progress. > > It is the page allocator who is refusing to yield CPU to those who need CPU. > > And we do that in the reclaim path. > No we don't. Please explain why holding oom_lock and not holding oom_lock in a patch shown below makes difference (i.e. holding oom_lock can avoid lockups while not holding oom_lock fails to avoid lockups). ---------- --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3137,50 +3137,53 @@ void exit_mmap(struct mm_struct *mm) /* * oom_reaper cannot race with the page tables teardown but we * want to make sure that the exit path can take over the full * tear down when it is safe to do so */ if (oom) { extern void my_setpriority(void); down_write(&mm->mmap_sem); __unlink_vmas(vma); + mutex_lock(&oom_lock); /* * the exit path is guaranteed to finish the memory tear down * without any unbound blocking at this stage so make it clear * to the oom_reaper */ mm->mmap = NULL; up_write(&mm->mmap_sem); my_setpriority(); __free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); } else { free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); } tlb_finish_mmu(&tlb, 0, -1); /* + * Now that the full address space is torn down, make sure the + * OOM killer skips over this task + */ + if (oom) { + set_bit(MMF_OOM_SKIP, &mm->flags); + mutex_unlock(&oom_lock); + } + + /* * Walk the list again, actually closing and freeing it, * with preemption enabled, without holding any MM locks. */ while (vma) { if (vma->vm_flags & VM_ACCOUNT) nr_accounted += vma_pages(vma); vma = remove_vma(vma); } vm_unacct_memory(nr_accounted); - - /* - * Now that the full address space is torn down, make sure the - * OOM killer skips over this task - */ - if (oom) - set_bit(MMF_OOM_SKIP, &mm->flags); } /* Insert vm structure into process list sorted by address * and into the inode's i_mmap tree. If vm_file is non-NULL * then i_mmap_rwsem is taken here. */ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) { struct vm_area_struct *prev; ---------- > > Since the OOM reaper kernel thread "has normal priority" and "can run on any > > CPU", the possibility of failing to run is lower than an OOM victim thread > > which "has idle priority" and "can run on only limited CPU". You are trying > > to add a dependency on such thread, and I'm saying that adding a dependency > > on such thread increases possibility of lockup. > > Sigh. No, this is not the case. All this patch series does is that we > hand over to the exiting task once it doesn't block on any locks > anymore. If the thread is low priority then it is quite likely that the > oom reaper is done by the time the victim even reaches the exit path. Not true. A thread executing exit_mmap() can change its priority at any moment. Also, the OOM reaper kernel thread can fail to complete OOM reaping before exit_mmap() does mm->mmap = NULL due to e.g. down_read_trylock(&mm->mmap_sem) failure, doing OOM reaping on other OOM victims in different OOM domains, being preempted by scheduling priority. You will notice it by trying both ---------- */ if (mm->mmap) set_bit(MMF_OOM_SKIP, &mm->flags); + else + pr_info("Handed over %u to exit path.\n", task_pid_nr(tsk)); /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); ---------- and ---------- */ if (mm->mmap) set_bit(MMF_OOM_SKIP, &mm->flags); + else if (!test_bit(MMF_OOM_SKIP, &mm->flags)) + pr_info("Handed over %u to exit path.\n", task_pid_nr(tsk)); /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); ---------- and checking the frequency of printing "Handed over " messages. > > > Yes, even the OOM reaper kernel thread might fail to run if all CPUs were > > busy with realtime threads waiting for the OOM reaper kernel thread to make > > progress. In that case, we had better stop relying on asynchronous memory > > reclaim, and switch to direct OOM reaping by allocating threads. > > > > But what I demonstrated is that > > > > /* > > * the exit path is guaranteed to finish the memory tear down > > * without any unbound blocking at this stage so make it clear > > * to the oom_reaper > > */ > > > > becomes a lie even when only one CPU was busy with realtime threads waiting > > for an idle thread to make progress. If the page allocator stops telling a > > lie that "an OOM victim is making progress on behalf of me", we can avoid > > the lockup. > > OK, I stopped reading right here. This discussion is pointless. Once you > busy loop all CPUs you are screwed. Look at the log files. I made only 1 CPU (out of 8 CPUs) busy. > Are you going to blame a filesystem > that no progress can be made if a code path holding an important lock > is preemempted by high priority stuff a no further progress can be > made? I don't blame that case because it is doing something which is not a kernel bug. > This is just ridiculous. What you are arguing here is not fixable > with the current upstream kernel. I do blame memory allocation case because it is doing something which is a kernel bug which can be avoided if we stop telling a lie. No future kernel is fixable as long as we keep telling a lie. > Even your so beloved timeout based > solution doesn't cope with that because oom reaper can be preempted for > unbound amount of time. Yes, that's the reason I suggest direct OOM reaping. > Your argument just doens't make much sense in > the context of the current kernel. Full stop. Won't full stop at all. What I'm saying is "don't rely on busy polling loop". The reclaim path becomes a no-op for a thread executing __free_pgtables() when memcg OOM is waiting for that thread to complete __free_pgtables() and set MMF_OOM_SKIP.