Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Mon, 17 Oct 2011 01:54:42 +0200

On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> and pagetable locks, were good enough before page migration (with its
> requirement that every migration entry be found) came in; and enough
> while migration always held mmap_sem.  But not enough nowadays, when
> there's memory hotremove and compaction: anon_vma lock is also needed,
> to make sure a migration entry is not dodging around behind our back.

For things like migrate and split_huge_page, the anon_vma layer must
guarantee the page is reachable by rmap walk at all times regardless
if it's at the old or new address.

This shall be guaranteed by the copy_vma called by move_vma well
before move_page_tables/move_ptes can run.

copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
into the anon_vma chains structures (vma_link does that). That before
any pte can be moved.

Because we keep two vmas mapped on both src and dst range, with
different vma->vm_pgoff that is valid for the page (the page doesn't
change its page->index) the page should always find _all_ its pte at
any given time.

There may be other variables at play like the order of insertion in
the anon_vma chain matches our direction of copy and removal of the
old pte. But I think the double locking of the PT lock should make the
order in the anon_vma chain absolutely irrelevant (the rmap_walk
obviously takes the PT lock too), and furthermore likely the
anon_vma_chain insertion is favorable (the dst vma is inserted last
and checked last). But it shouldn't matter.

Another thing could be the copy_vma vma_merge branch succeeding
(returning not NULL) but I doubt we risk to fall into that one. For
the rmap_walk to be always working on both the src and dst
vma->vma_pgoff the pgoff must be different so we can't possibly be ok
if there's just 1 vma covering the whole range. I exclude this could
be the case because the pgoff passed to copy_vma is different than the
vma->vm_pgoff given to copy_vma, so vma_merge can't possibly succeed.

Yet another point to investigate is the point where we teardown the
old vma and we leave the new vma generated by copy_vma
established. That's apparently taken care of by do_munmap in move_vma
so that shall be safe too as munmap is safe in the first place.

Overall I don't think this patch is needed and it seems a noop.

> It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
> shift_arg_pages() and rmap_walk() during migration by not migrating
> temporary stacks" was actually a workaround for this in the special
> common case of exec's use of move_pagetables(); and we should probably
> now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.

I don't think this patch can help with that, the problem of execve vs
rmap_walk is that there's 1 single vma existing for src and dst
virtual ranges while execve runs move_page_tables. So there is no
possible way that rmap_walk will be guaranteed to find _all_ ptes
mapping a page if there's just one vma mapping either the src or dst
range while move_page_table runs. No addition of locking whatsoever
can fix that bug because we miss a vma (well modulo locking that
prevents rmap_walk to run at all, until we're finished with execve,
which is more or less what VM_STACK_INCOMPLETE_SETUP does...).

The only way is to fix this is prevent migrate (or any other rmap_walk
user that requires 100% reliability from the rmap layer, for example
swap doesn't require 100% reliability and can still run and gracefully
fail at finding the pte) while we're moving pagetables in execve. And
that's what Mel's above mentioned patch does.

The other way to fix that bug that I implemented was to do copy_vma in
execve, so that we still have both src and dst ranges of
move_page_tables covered by 2 (not 1) vma, each with the proper
vma->vm_pgoff, so my approach fixed that bug as well (but requires a
vma allocation in execve so it was dropped in favor of Mel's patch
which is totally fine with as both approaches fixes the bug equally
well, even if now we've to deal with this special case of sometime
rmap_walk having false negatives if the vma_flags is set, and the
important thing is that after VM_STACK_INCOMPLETE_SETUP has been
cleared it won't ever be set again for the whole lifetime of the vma).

I may be missing something, I did a short review so far, just so the
patch doesn't get merged if not needed. I mean I think it needs a bit
more looks on it... The fact the i_mmap_mutex was taken but the
anon_vma lock was not taken (while in every other place they both are
needed) certainly makes the patch look correct, but that's just a
misleading coincidence I think.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>