On Thu, Apr 14 2022 at 10:40, Nico Pache wrote: > The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can > be targeted by the oom reaper. This mapping is used to store the futex > robust list head; the kernel does not keep a copy of the robust list and > instead references a userspace address to maintain the robustness during > a process death. A race can occur between exit_mm and the oom reaper that > allows the oom reaper to free the memory of the futex robust list before > the exit path has handled the futex death: > > CPU1 CPU2 > ------------------------------------------------------------------------ > page_fault > do_exit "signal" > wake_oom_reaper > oom_reaper > oom_reap_task_mm (invalidates mm) > exit_mm > exit_mm_release > futex_exit_release > futex_cleanup > exit_robust_list > get_user (EFAULT- can't access memory) > > If the get_user EFAULT's, the kernel will be unable to recover the > waiters on the robust_list, leaving userspace mutexes hung indefinitely. > > Delay the OOM reaper, allowing more time for the exit path to perform > the futex cleanup. > > Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer > > [1] https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L370 A link to the original discussion about this would be more useful than a code reference which is stale tomorrow. The above explanation is good enough to describe the problem. > > +/* > + * Give the OOM victim time to exit naturally before invoking the oom_reaping. > + * The timers timeout is arbitrary... the longer it is, the longer the worst > + * case scenario for the OOM can take. If it is too small, the oom_reaper can > + * get in the way and release resources needed by the process exit path. > + * e.g. The futex robust list can sit in Anon|Private memory that gets reaped > + * before the exit path is able to wake the futex waiters. > + */ > +#define OOM_REAPER_DELAY (2*HZ) > +static void queue_oom_reaper(struct task_struct *tsk) Bah. Did you run out of newlines? Glueing that define between the comment and the function is unreadable. Other than that. Acked-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>