On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote: > mremap's down_write of mmap_sem, together with i_mmap_mutex/lock, > and pagetable locks, were good enough before page migration (with its > requirement that every migration entry be found) came in; and enough > while migration always held mmap_sem. But not enough nowadays, when > there's memory hotremove and compaction: anon_vma lock is also needed, > to make sure a migration entry is not dodging around behind our back. For things like migrate and split_huge_page, the anon_vma layer must guarantee the page is reachable by rmap walk at all times regardless if it's at the old or new address. This shall be guaranteed by the copy_vma called by move_vma well before move_page_tables/move_ptes can run. copy_vma obviously takes the anon_vma lock to insert the new "dst" vma into the anon_vma chains structures (vma_link does that). That before any pte can be moved. Because we keep two vmas mapped on both src and dst range, with different vma->vm_pgoff that is valid for the page (the page doesn't change its page->index) the page should always find _all_ its pte at any given time. There may be other variables at play like the order of insertion in the anon_vma chain matches our direction of copy and removal of the old pte. But I think the double locking of the PT lock should make the order in the anon_vma chain absolutely irrelevant (the rmap_walk obviously takes the PT lock too), and furthermore likely the anon_vma_chain insertion is favorable (the dst vma is inserted last and checked last). But it shouldn't matter. Another thing could be the copy_vma vma_merge branch succeeding (returning not NULL) but I doubt we risk to fall into that one. For the rmap_walk to be always working on both the src and dst vma->vma_pgoff the pgoff must be different so we can't possibly be ok if there's just 1 vma covering the whole range. I exclude this could be the case because the pgoff passed to copy_vma is different than the vma->vm_pgoff given to copy_vma, so vma_merge can't possibly succeed. Yet another point to investigate is the point where we teardown the old vma and we leave the new vma generated by copy_vma established. That's apparently taken care of by do_munmap in move_vma so that shall be safe too as munmap is safe in the first place. Overall I don't think this patch is needed and it seems a noop. > It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between > shift_arg_pages() and rmap_walk() during migration by not migrating > temporary stacks" was actually a workaround for this in the special > common case of exec's use of move_pagetables(); and we should probably > now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup. I don't think this patch can help with that, the problem of execve vs rmap_walk is that there's 1 single vma existing for src and dst virtual ranges while execve runs move_page_tables. So there is no possible way that rmap_walk will be guaranteed to find _all_ ptes mapping a page if there's just one vma mapping either the src or dst range while move_page_table runs. No addition of locking whatsoever can fix that bug because we miss a vma (well modulo locking that prevents rmap_walk to run at all, until we're finished with execve, which is more or less what VM_STACK_INCOMPLETE_SETUP does...). The only way is to fix this is prevent migrate (or any other rmap_walk user that requires 100% reliability from the rmap layer, for example swap doesn't require 100% reliability and can still run and gracefully fail at finding the pte) while we're moving pagetables in execve. And that's what Mel's above mentioned patch does. The other way to fix that bug that I implemented was to do copy_vma in execve, so that we still have both src and dst ranges of move_page_tables covered by 2 (not 1) vma, each with the proper vma->vm_pgoff, so my approach fixed that bug as well (but requires a vma allocation in execve so it was dropped in favor of Mel's patch which is totally fine with as both approaches fixes the bug equally well, even if now we've to deal with this special case of sometime rmap_walk having false negatives if the vma_flags is set, and the important thing is that after VM_STACK_INCOMPLETE_SETUP has been cleared it won't ever be set again for the whole lifetime of the vma). I may be missing something, I did a short review so far, just so the patch doesn't get merged if not needed. I mean I think it needs a bit more looks on it... The fact the i_mmap_mutex was taken but the anon_vma lock was not taken (while in every other place they both are needed) certainly makes the patch look correct, but that's just a misleading coincidence I think. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>