Re: [PATCH 1/1] [v4.9.y] coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping

Ajay Kaher <akaher@xxxxxxxxxx> · Thu, 6 Jun 2019 10:25:23 +0000

On 22/05/19, 5:37 PM, "Jason Gunthorpe" <jgg@xxxxxxxxxxxx> wrote:

>On Thu, May 23, 2019 at 01:07:30AM +0530, Ajay Kaher wrote:
>> From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
>> 
>> commit 1eb719f09f7e319e79f6abf2b9e8c0dcc1c477b5 upstream.
>
> This is not the right commit id..
In next version of this patch I will add the correct commit i.e. 04f5866e41fb70690e28397487d8bd8eea7d712a 
>
>> The core dumping code has always run without holding the mmap_sem for
>> writing, despite that is the only way to ensure that the entire vma
>> layout will not change from under it.  Only using some signal
>> serialization on the processes belonging to the mm is not nearly enough.
>> This was pointed out earlier.  For example in Hugh's post from Jul 2017:
>> 
>>   https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2Falpine.LSU.2.11.1707191716030.2055%40eggly.anvils&amp;data=02%7C01%7Cakaher%40vmware.com%7Cb42dd5dedf954b7def5908d6deae17d9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636941236650210488&amp;sdata=c47vXAvgRV1GKWdBffBxj0I8UgJ7voQXIA7UPEUNq54%3D&amp;reserved=0
>> 
>>   "Not strictly relevant here, but a related note: I was very surprised
>>     to discover, only quite recently, how handle_mm_fault() may be called
>>     without down_read(mmap_sem) - when core dumping. That seems a
>>     misguided optimization to me, which would also be nice to correct"
>> 
>> In particular because the growsdown and growsup can move the
>> vm_start/vm_end the various loops the core dump does around the vma will
>> not be consistent if page faults can happen concurrently.
>> 
>> Pretty much all users calling mmget_not_zero()/get_task_mm() and then
>> taking the mmap_sem had the potential to introduce unexpected side
>> effects in the core dumping code.
>> 
>> Adding mmap_sem for writing around the ->core_dump invocation is a
>> viable long term fix, but it requires removing all copy user and page
>> faults and to replace them with get_dump_page() for all binary formats
>> which is not suitable as a short term fix.
>> 
>> For the time being this solution manually covers the places that can
>> confuse the core dump either by altering the vma layout or the vma flags
>> while it runs.  Once ->core_dump runs under mmap_sem for writing the
>> function mmget_still_valid() can be dropped.
>> 
>> Allowing mmap_sem protected sections to run in parallel with the
>> coredump provides some minor parallelism advantage to the swapoff code
>> (which seems to be safe enough by never mangling any vma field and can
>> keep doing swapins in parallel to the core dumping) and to some other
>> corner case.
>> 
>> In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
>> however the side effect of this same race condition in /proc/pid/mem
>> should be reproducible since before 2.6.12-rc2 so I couldn't add any
>> other "Fixes:" because there's no hash beyond the git genesis commit.
>> 
>> Because find_extend_vma() is the only location outside of the process
>> context that could modify the "mm" structures under mmap_sem for
>> reading, by adding the mmget_still_valid() check to it, all other cases
>> that take the mmap_sem for reading don't need the new check after
>> mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
>> context also doesn't need the new check, because all tasks under core
>> dumping are frozen.
>> 
>> Link: https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Flkml.kernel.org%2Fr%2F20190325224949.11068-1-aarcange%40redhat.com&amp;data=02%7C01%7Cakaher%40vmware.com%7Cb42dd5dedf954b7def5908d6deae17d9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636941236650220488&amp;sdata=nJ7a8%2FLkSU4eT%2FioUqf8tj2ewzJKcyrg2R0ZjGWEZzA%3D&amp;reserved=0
>> Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
>> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
>> Reported-by: Jann Horn <jannh@xxxxxxxxxx>
>> Suggested-by: Oleg Nesterov <oleg@xxxxxxxxxx>
>> Acked-by: Peter Xu <peterx@xxxxxxxxxx>
>> Reviewed-by: Mike Rapoport <rppt@xxxxxxxxxxxxx>
>> Reviewed-by: Oleg Nesterov <oleg@xxxxxxxxxx>
>> Reviewed-by: Jann Horn <jannh@xxxxxxxxxx>
>> Acked-by: Jason Gunthorpe <jgg@xxxxxxxxxxxx>
>> Acked-by: Michal Hocko <mhocko@xxxxxxxx>
>> Cc: <stable@xxxxxxxxxxxxxxx>
>> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
>> [Ajay: Dropped infiniband changes to get patch to apply on 4.9.y]
>
> I think in this kernel the mm handling code for IB is in three
> different drivers, it probably needs to be fixed too?

Thanks Jason for pointing this.

Crossed checked the locking of mmap_sem in IB driver code of 4.9 to 4.14 with >5.0
and found it requires to handle at following locations of 4.9 and 4.14:
mlx4_ib_disassociate_ucontext() of drivers/infiniband/hw/mlx5/main.c:
mlx5_ib_disassociate_ucontext() of drivers/infiniband/hw/mlx4/main.c

To fix at above location, would you like me to modify the original patch or submit in another patch.

Ajay

> Jason