On Thu 17-03-22 21:36:21, Nico Pache wrote: > The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can > be targeted by the oom reaper. This mapping is used to store the futex > robust list; the kernel does not keep a copy of the robust list and instead > references a userspace address to maintain the robustness during a process > death. A race can occur between exit_mm and the oom reaper that allows > the oom reaper to free the memory of the futex robust list before the > exit path has handled the futex death: > > CPU1 CPU2 > ------------------------------------------------------------------------ > page_fault > out_of_memory > do_exit "signal" > wake_oom_reaper > oom_reaper > oom_reap_task_mm (invalidates mm) > exit_mm > exit_mm_release > futex_exit_release > futex_cleanup > exit_robust_list > get_user (EFAULT- can't access memory) I still think it is useful to be explicit about the consequences of the EFAULT here. Did you want to mention that a failing get_user in this path would result in a hang because nobody is woken up when the current holder of the lock terminates. > While in the oom reaper thread, we must handle the futex cleanup without > sleeping. To achieve this, add the boolean `try` to futex_exit_begin(). > This will control weather or not we use a trylock. Also introduce > try_futex_exit_release() which will utilize the trylock version of the > futex_cleanup_begin(). Also call kthread_use_mm in this context to assure > the get_user call in futex_cleanup() does not return with EFAULT. This alone is not sufficient. get_user can sleep in the #PF handler path (e.g. by waiting for swap in). Or is there any guarantee that the page is never swapped out? If we cannot rule out #PF then this is not a viable way to address the problem I am afraid. [...] > @@ -587,6 +588,18 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > goto out_unlock; > } > > + /* We can't reap a process holding a robust_list; the pthread > + * struct is allocated in userspace using PRIVATE | ANONYMOUS > + * memory which when reaped before futex_cleanup() can leave > + * the waiting process stuck. Try to perform the futex_cleanup, > + * and if unsuccessful, skip the OOM reaping. > + */ > + if (task_has_robust_list(tsk) && !try_futex_exit_release(tsk)) { > + trace_skip_task_reaping(tsk->pid); > + pr_info("oom_reaper: skipping task as it contains a robust list"); > + goto out_finish; > + } > + > trace_start_task_reaping(tsk->pid); > > /* failed to reap part of the address space. Try again later */ Please also note that this all is done after mmap_lock has been already taken so a page fault could deadlock on the mmap_lock. The more I am thinking about this the more I am getting convinced that we should rather approach this differently and skip over vmas which can be holding the list. Have you considered this option? -- Michal Hocko SUSE Labs