On 1/17/22 11:05, Waiman Long wrote: > On 1/17/22 03:52, Michal Hocko wrote: >> On Fri 14-01-22 13:01:35, Nico Pache wrote: >>> In the case that two or more processes share a futex located within >>> a shared mmaped region, such as a process that shares a lock between >>> itself and child processes, we have observed that when a process holding >>> the lock is oom killed, at least one waiter is never alerted to this new >>> development and simply continues to wait. >>> >>> This is visible via pthreads by checking the __owner field of the >>> pthread_mutex_t structure within a waiting process, perhaps with gdb. >>> >>> We identify reproduction of this issue by checking a waiting process of >>> a test program and viewing the contents of the pthread_mutex_t, taking note >>> of the value in the owner field, and then checking dmesg to see if the >>> owner has already been killed. >> I believe we really need to find out why the original holder of the >> futex is not woken up to release the lock when exiting. > > For a robust futex lock holder or waiter that is to be killed, it is not the > responsibility of the task itself to wake up and release the lock. It is the > kernel that recognizes that the task is holding or waiting for the robust futex > and clean thing up. > > >>> As mentioned by Michal in his patchset introducing the oom reaper, >>> commit aac4536355496 ("mm, oom: introduce oom reaper"), the purpose of the >>> oom reaper is to try and free memory more quickly; however, In the case >>> that a robust futex is being used, we want to avoid utilizing the >>> concurrent oom reaper. This is due to a race that can occur between the >>> SIGKILL handling the robust futex, and the oom reaper freeing the memory >>> needed to maintain the robust list. >> OOM reaper is only unmapping private memory. It doesn't touch a shared >> mappings. So how could it interfere? >> > The futex itself may be in shared memory, however the robust list entry can be > in private memory. So when the robust list is being scanned in this case, we can > be in a use-after-free situation. I believe this is true. The userspace allocation for the pthread occurs as a private mapping: https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L368 >>> In the case that the oom victim is utilizing a robust futex, and the >>> SIGKILL has not yet handled the futex death, the tsk->robust_list should >>> be non-NULL. This issue can be tricky to reproduce, but with the >>> modifications of this patch, we have found it to be impossible to >>> reproduce. >> We really need a deeper analysis of the udnerlying problem because right >> now I do not really see why the oom reaper should interfere with shared >> futex. > As I said above, the robust list processing can involve private memory. >> >>> Add a check for tsk->robust_list is non-NULL in wake_oom_reaper() to return >>> early and prevent waking the oom reaper. >>> >>> Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer >>> >>> Co-developed-by: Joel Savitz <jsavitz@xxxxxxxxxx> >>> Signed-off-by: Joel Savitz <jsavitz@xxxxxxxxxx> >>> Signed-off-by: Nico Pache <npache@xxxxxxxxxx> >>> --- >>> mm/oom_kill.c | 15 +++++++++++++++ >>> 1 file changed, 15 insertions(+) >>> >>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >>> index 1ddabefcfb5a..3cdaac9c7de5 100644 >>> --- a/mm/oom_kill.c >>> +++ b/mm/oom_kill.c >>> @@ -667,6 +667,21 @@ static void wake_oom_reaper(struct task_struct *tsk) >>> if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) >>> return; >>> +#ifdef CONFIG_FUTEX >>> + /* >>> + * If the ooming task's SIGKILL has not finished handling the >>> + * robust futex it is not correct to reap the mm concurrently. >>> + * Do not wake the oom reaper when the task still contains a >>> + * robust list. >>> + */ >>> + if (tsk->robust_list) >>> + return; >>> +#ifdef CONFIG_COMPAT >>> + if (tsk->compat_robust_list) >>> + return; >>> +#endif >>> +#endif >> If this turns out to be really needed, which I do not really see at the >> moment, then this is not the right way to handle this situation. The oom >> victim could get stuck and the oom killer wouldn't be able to move >> forward. If anything the victim would need to get MMF_OOM_SKIP set. I will try this, but I don't immediately see any difference between this return case and setting the bit, passing the oom_reaper_list, then skipping it based on the flag. Do you mind explaining how this could lead to the oom killer getting stuck? Cheers, -- Nico > > There can be other way to do that, but letting the normal kill signal processing > finishing its job and properly invoke futex_cleanup() is certainly one possible > solution. > > Cheers, > Longman >