On Mon, Nov 07, 2011 at 04:28:08PM +0000, Mel Gorman wrote: > Note that I didn't suddenly turn that ack into a nack although :) > 1) A small comment on why we need to call anon_vma_moveto_tail in the > error path would be nice I can add that. > 2) It is unfortunate that we need the faulted_in_anon_vma just > for a VM_BUG_ON check that only exists for CONFIG_DEBUG_VM > but not earth shatting It should be optimized away at build time. It thought it was better not to leave that path without a VM_BUG_ON. It should be a slow path in the first place (probably we should even mark it unlikely). And it's obscure enough that I think a check will clarify things. In the common case (i.e. some pte faulted in) that vma_merge on self if it succeeds, it couldn't possibly be safe because the vma->vm_pgoff vs page->index linearity couldn't be valid for the same vma and the same page on two different virtual addresses. So checking for it I think is sane. Especially given at some point it was mentioned we could optimize away the check all together, so it's a bit of an obscure path that the VM_BUG_ON I think will help document (and verify). > What I said was taking the anon_vma lock may be slower but it was > generally easier to understand. I'm happy with the new patch too > particularly as it keeps the "ordering game" consistent for fork > and mremap but I previously missed move_page_tables in the error > path so was worried if there was something else I managed to miss > particularly in light of the "Allocating a new vma, copy first and > merge later" direction. I liked that direction a lot. I thought with that we could stick to the exact same behavior of fork and not need to reorder stuff. But the error path is still in the way, and we've to undo the move in place without tearing down the vmas. Plus it would have required to write mode code, and the allocation path wouldn't have necessarily been faster than a reordering if the list is not huge. > I'm also prefectly happy with my human meat brain and do not expect > to replace it with an aliens. 8-) On a totally different but related topic, unmap_mapping_range_tree walks the prio tree the same way try_to_unmap_file walks it and if truncate can truncate "dst" before "src" then supposedly the try_to_unmap_file could miss a migration entry copied into the "child" ptep while fork runs too... But I think there is no risk there because we don't establish migration ptes there, and we just unmap the pagecache, so worst case we'll abort migration if the race trigger and we'll retry later. But I wonder what happens if truncate runs against fork, if truncate can drop ptes from dst before src (like mremap comment says), we could still end up with some pte mapped to the file in the ptes of the child, even if the pte was correctly truncated in the parent... Overall I think fork/mremap vs fully_reliable_rmap_walk/truncate aren't fundamentally different in relation. If we relay on ordering for anon pages in fork it's not adding too much mess to also relay on ordering for mremap. If we take the i_mmap_mutex in mremap because we can't enforce a order in the prio tree, then we need the i_mmap_mutex in fork too (and that's missing). But nothing prevents us to use a lock in mreamp and ordering in fork. I think the decision should be based more on performance expectations. So we could add the ordering to mremap (patch posted), and add the i_mmap_mutex to fork, or we add the anon_vma lock in both mremap and fork, and the i_mmap_lock to fork. Also note, if we find a way to enforce orderings in the prio tree (not sure if it's possible, apparently it's already using list_add_tail so..), then we could also remove the i_mmap_lock from mremap and fork. Keeping the anon and file cases separated it's better though, I think the patch I posted should close the race and be ok. Pending your requested changes. If you think taking the lock is faster it's fine with me, but I think taking the anon_vma lock once per VMA (plus the anon_vma_chail list walk), and reduce the per-pagetable locking overhead is better. Ideally the anon_vma_chain lists won't be long anyway. And if they are long and lots of processes do mremap at the same time it should still work better. The anon_vma root lock is not so small lock to take and better not to take it repeatedly. I also recall Andi's patches to try to avoid doing lock/unlock in a tight loop, if we take it and we do some work with it hold, is likely better than bouncing it at high freq across CPUs for each pmd. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>