On Tue, May 24, 2016 at 12:49:42AM +0300, Kirill A. Shutemov wrote: > That's what we do now and that's not enough. > > We would need to serialize against pmd_lock() during normal page-fault > path (and other pte manipulation), which we don't do now if pmd points to > page table. Yes, mmap_sem for writing while converting the pmd to a pmd_trans_huge() in khugepaged, is so that the pagetable walk doesn't require the pmd_lock after holding the mmap_sem for reading if the pmd is found !pmd_trans_unstable, i.e. if the pmd points to a pte. This way the non-THP pte walk retains the identical cost it has with THP not compiled into the kernel (even when THP is enabled). khugepaged already starts by doing work with the mmap_sem for reading, then while holding the mmap_sem for reading if khugepaged_scan_pmd() finds a candidate pmd to collapse into a pmd_trans_huge(), it calls collapse_huge_page which at some point releases the mmap_sem for reading (before the THP memory allocation) and takes it again for writing if the allocation succeeded and we can go ahead with the atomic THP collapse (under mmap_sem for writing and under the anon_vma lock for writing too to serialize against split_huge_page which can be called on the physical page and doesn't hold any mmap_sem but just finds the pagetables through a rmap walk). The atomic part is all non blocking. The swapin loop can run under the mmap_sem for reading if it does the proper check to revalidate the vma, but it should move above the below comment. /* * Prevent all access to pagetables with the exception of * gup_fast later hanlded by the ptep_clear_flush and the VM * handled by the anon_vma lock + PG_lock. */ down_write(&mm->mmap_sem); Which is more or less what the last patch was doing except by keeping the comment above the swapin stage, it made the comment wrong, as the comment was then followed by a down_read. Aside from the comment being wrong (which is not a kernel crashing issue), the real problem was lack of revalidates after releasing the mmap_sem and this revalidate attempt is also not correct: + vma = find_vma(mm, address); + /* vma is no longer available, don't continue to swapin */ + if (vma != vma_orig) + return false; Because the mmap_sem was temporarily dropped, the vma may have been freed and reallocated at the same address, but it may be a completely different vma with different vm_start/end values or it may not be anonymous or mremap may have altered the vm_start/end too or the "mm" may have exited in the meanwhile. collapse_huge_page already shows how to correctly to revalidate the vma after dropping the mmap_sem temporarily: down_write(&mm->mmap_sem); if (unlikely(khugepaged_test_exit(mm))) { result = SCAN_ANY_PROCESS; goto out; } vma = find_vma(mm, address); if (!vma) { result = SCAN_VMA_NULL; goto out; } hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; hend = vma->vm_end & HPAGE_PMD_MASK; if (address < hstart || address + HPAGE_PMD_SIZE > hend) { result = SCAN_ADDRESS_RANGE; goto out; } if (!hugepage_vma_check(vma)) { result = SCAN_VMA_CHECK; goto out; } All checks above are needed for a correct revalidate, otherwise the above code could also have been replaced by a vma != vma_orig. If we move this swapin stage before the comment and after the THP allocation succeeded, and we do enough revalidates correctly (which are currently missing or incorrect, and find_vma is not enough for a revalidate, find_vma only says there's some random vma with vm_end > address), it should then work ok under only the mmap_sem for reading. Overall the last patch goes in the right direction just it needs to do all revalidates right and to move the swapin stage a bit more up to avoid invalidating the comment I think. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>