Hi Kirill, On Mon, Apr 14, 2014 at 05:42:18PM +0300, Kirill A. Shutemov wrote: > I've spent few day trying to understand rmap code. And now I think my > patch is wrong. > > I actually don't see where walk order requirement comes from. It seems all > operations (insert, remove, foreach) on anon_vma is serialized with > anon_vma->root->rwsem. Andrea, could you explain this for me? It's true the locking protects and freezes the view of all anon_vma structures associated with the page, but that only guarantees you not to miss the vma. Not missing the vma is not enough. You can still miss a pte during the rmap_walk if the order is wrong, because the pte/pmds are still moving freely under the vmas (absent of the PT lock and the mmap_sem). The problem are all MM operations that copies or move a page mapping from a source to destination vma (fork and mremap). They take the anon_vma lock, insert the destination vma with the proper anon_vma chains and then they _drop_ the anon vma lock, and only later they start moving ptes and pmds around (by taking the proper PT locks). anon_vma -> src_vma -> dst_vma If the order is like above (guaranteed before the interval tree was introduced), if the rmap_walk of split_huge_page and migrate encounters the source pte/pmd and split/unmap it before it gets copied, then the copy or move will retain the processed state (regular pte instead of trans_huge_pmd for split_huge_page or migration pte for migrate). If instead the pmd/pte was already copied by the time the src_vma is scanned, then it will encounter the copy to process in the dst_vma too. The rmap_walk can't miss a pte/pmd if the anon_vma chain is walked in insertion order (i.e. older vma first). anon_vma -> dst_vma -> src_vma If the anon_vma walk order is reversed vs the insertion order, things falls apart because you will scan dst_vma in split_huge_page while it still empty, find nothing, then the trans_huge_pmd is moved or copied from src_vma to dst_vma by the MM code only holding PT lock and mmap_sem for writing (we cannot hold those across the whole duration of split_huge_page and migrate). So if the rmap_walk order is not right, the rmap_walk can miss the contents of the dst_vma that was still empty at the time it was processed. If the interval tree walk order cannot be fixed without screwing with the computation complexity of the structure, a more black and white fix could be to add a anon_vma templist to scan in O(N) after the interval tree has been scanned, where you add newly inserted vmas. The templist shall then be flushed back to the interval tree only after the pte/pmd mangling of the MM operation is completed. That requires identifying the closure of the critical section for those problematic MM operations. The main drawback is actually having to take the anon_vma lock twice, the second time for the flush to the interval tree. Looping like in your previous patch would be much simpler if it could be made reliable, but it looked like it wouldn't close the bug entirely because any concurrent unmap operation could lead to false negative hiding the pmd/pte walk miss (by decreasing page->mapcount under us). Comments? Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>