On 22/05/19, 5:37 PM, "Jason Gunthorpe" <jgg@xxxxxxxxxxxx> wrote: >On Thu, May 23, 2019 at 01:07:30AM +0530, Ajay Kaher wrote: >> From: Andrea Arcangeli <aarcange@xxxxxxxxxx> >> >> commit 1eb719f09f7e319e79f6abf2b9e8c0dcc1c477b5 upstream. > > This is not the right commit id.. In next version of this patch I will add the correct commit i.e. 04f5866e41fb70690e28397487d8bd8eea7d712a > >> The core dumping code has always run without holding the mmap_sem for >> writing, despite that is the only way to ensure that the entire vma >> layout will not change from under it. Only using some signal >> serialization on the processes belonging to the mm is not nearly enough. >> This was pointed out earlier. For example in Hugh's post from Jul 2017: >> >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2Falpine.LSU.2.11.1707191716030.2055%40eggly.anvils&data=02%7C01%7Cakaher%40vmware.com%7Cb42dd5dedf954b7def5908d6deae17d9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636941236650210488&sdata=c47vXAvgRV1GKWdBffBxj0I8UgJ7voQXIA7UPEUNq54%3D&reserved=0 >> >> "Not strictly relevant here, but a related note: I was very surprised >> to discover, only quite recently, how handle_mm_fault() may be called >> without down_read(mmap_sem) - when core dumping. That seems a >> misguided optimization to me, which would also be nice to correct" >> >> In particular because the growsdown and growsup can move the >> vm_start/vm_end the various loops the core dump does around the vma will >> not be consistent if page faults can happen concurrently. >> >> Pretty much all users calling mmget_not_zero()/get_task_mm() and then >> taking the mmap_sem had the potential to introduce unexpected side >> effects in the core dumping code. >> >> Adding mmap_sem for writing around the ->core_dump invocation is a >> viable long term fix, but it requires removing all copy user and page >> faults and to replace them with get_dump_page() for all binary formats >> which is not suitable as a short term fix. >> >> For the time being this solution manually covers the places that can >> confuse the core dump either by altering the vma layout or the vma flags >> while it runs. Once ->core_dump runs under mmap_sem for writing the >> function mmget_still_valid() can be dropped. >> >> Allowing mmap_sem protected sections to run in parallel with the >> coredump provides some minor parallelism advantage to the swapoff code >> (which seems to be safe enough by never mangling any vma field and can >> keep doing swapins in parallel to the core dumping) and to some other >> corner case. >> >> In order to facilitate the backporting I added "Fixes: 86039bd3b4e6" >> however the side effect of this same race condition in /proc/pid/mem >> should be reproducible since before 2.6.12-rc2 so I couldn't add any >> other "Fixes:" because there's no hash beyond the git genesis commit. >> >> Because find_extend_vma() is the only location outside of the process >> context that could modify the "mm" structures under mmap_sem for >> reading, by adding the mmget_still_valid() check to it, all other cases >> that take the mmap_sem for reading don't need the new check after >> mmget_not_zero()/get_task_mm(). The expand_stack() in page fault >> context also doesn't need the new check, because all tasks under core >> dumping are frozen. >> >> Link: https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Flkml.kernel.org%2Fr%2F20190325224949.11068-1-aarcange%40redhat.com&data=02%7C01%7Cakaher%40vmware.com%7Cb42dd5dedf954b7def5908d6deae17d9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636941236650220488&sdata=nJ7a8%2FLkSU4eT%2FioUqf8tj2ewzJKcyrg2R0ZjGWEZzA%3D&reserved=0 >> Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization") >> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> >> Reported-by: Jann Horn <jannh@xxxxxxxxxx> >> Suggested-by: Oleg Nesterov <oleg@xxxxxxxxxx> >> Acked-by: Peter Xu <peterx@xxxxxxxxxx> >> Reviewed-by: Mike Rapoport <rppt@xxxxxxxxxxxxx> >> Reviewed-by: Oleg Nesterov <oleg@xxxxxxxxxx> >> Reviewed-by: Jann Horn <jannh@xxxxxxxxxx> >> Acked-by: Jason Gunthorpe <jgg@xxxxxxxxxxxx> >> Acked-by: Michal Hocko <mhocko@xxxxxxxx> >> Cc: <stable@xxxxxxxxxxxxxxx> >> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> >> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> >> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> >> [Ajay: Dropped infiniband changes to get patch to apply on 4.9.y] > > I think in this kernel the mm handling code for IB is in three > different drivers, it probably needs to be fixed too? Thanks Jason for pointing this. Crossed checked the locking of mmap_sem in IB driver code of 4.9 to 4.14 with >5.0 and found it requires to handle at following locations of 4.9 and 4.14: mlx4_ib_disassociate_ucontext() of drivers/infiniband/hw/mlx5/main.c: mlx5_ib_disassociate_ucontext() of drivers/infiniband/hw/mlx4/main.c To fix at above location, would you like me to modify the original patch or submit in another patch. Ajay > Jason