On Thu, May 23, 2019 at 01:07:30AM +0530, Ajay Kaher wrote: > From: Andrea Arcangeli <aarcange@xxxxxxxxxx> > > commit 1eb719f09f7e319e79f6abf2b9e8c0dcc1c477b5 upstream. This is not the right commit id.. > The core dumping code has always run without holding the mmap_sem for > writing, despite that is the only way to ensure that the entire vma > layout will not change from under it. Only using some signal > serialization on the processes belonging to the mm is not nearly enough. > This was pointed out earlier. For example in Hugh's post from Jul 2017: > > https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils > > "Not strictly relevant here, but a related note: I was very surprised > to discover, only quite recently, how handle_mm_fault() may be called > without down_read(mmap_sem) - when core dumping. That seems a > misguided optimization to me, which would also be nice to correct" > > In particular because the growsdown and growsup can move the > vm_start/vm_end the various loops the core dump does around the vma will > not be consistent if page faults can happen concurrently. > > Pretty much all users calling mmget_not_zero()/get_task_mm() and then > taking the mmap_sem had the potential to introduce unexpected side > effects in the core dumping code. > > Adding mmap_sem for writing around the ->core_dump invocation is a > viable long term fix, but it requires removing all copy user and page > faults and to replace them with get_dump_page() for all binary formats > which is not suitable as a short term fix. > > For the time being this solution manually covers the places that can > confuse the core dump either by altering the vma layout or the vma flags > while it runs. Once ->core_dump runs under mmap_sem for writing the > function mmget_still_valid() can be dropped. > > Allowing mmap_sem protected sections to run in parallel with the > coredump provides some minor parallelism advantage to the swapoff code > (which seems to be safe enough by never mangling any vma field and can > keep doing swapins in parallel to the core dumping) and to some other > corner case. > > In order to facilitate the backporting I added "Fixes: 86039bd3b4e6" > however the side effect of this same race condition in /proc/pid/mem > should be reproducible since before 2.6.12-rc2 so I couldn't add any > other "Fixes:" because there's no hash beyond the git genesis commit. > > Because find_extend_vma() is the only location outside of the process > context that could modify the "mm" structures under mmap_sem for > reading, by adding the mmget_still_valid() check to it, all other cases > that take the mmap_sem for reading don't need the new check after > mmget_not_zero()/get_task_mm(). The expand_stack() in page fault > context also doesn't need the new check, because all tasks under core > dumping are frozen. > > Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@xxxxxxxxxx > Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization") > Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> > Reported-by: Jann Horn <jannh@xxxxxxxxxx> > Suggested-by: Oleg Nesterov <oleg@xxxxxxxxxx> > Acked-by: Peter Xu <peterx@xxxxxxxxxx> > Reviewed-by: Mike Rapoport <rppt@xxxxxxxxxxxxx> > Reviewed-by: Oleg Nesterov <oleg@xxxxxxxxxx> > Reviewed-by: Jann Horn <jannh@xxxxxxxxxx> > Acked-by: Jason Gunthorpe <jgg@xxxxxxxxxxxx> > Acked-by: Michal Hocko <mhocko@xxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> > [Ajay: Dropped infiniband changes to get patch to apply on 4.9.y] I think in this kernel the mm handling code for IB is in three different drivers, it probably needs to be fixed too? Jason