Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > Jerome Glisse <jglisse@xxxxxxxxxx> wrote: > >> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote: >>> Jérôme Glisse <jglisse@xxxxxxxxxx> wrote: >>> >>>> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range() >>>> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/ >>>> end. >>>> >>>> Note that because we can not presume the pmd value or pte value we have to >>>> assume the worse and unconditionaly report an invalidation as happening. >>>> >>>> Signed-off-by: Jérôme Glisse <jglisse@xxxxxxxxxx> >>>> Cc: Dan Williams <dan.j.williams@xxxxxxxxx> >>>> Cc: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> >>>> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> >>>> Cc: Bernhard Held <berny156@xxxxxx> >>>> Cc: Adam Borowski <kilobyte@xxxxxxxxxx> >>>> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> >>>> Cc: Radim Krčmář <rkrcmar@xxxxxxxxxx> >>>> Cc: Wanpeng Li <kernellwp@xxxxxxxxx> >>>> Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx> >>>> Cc: Takashi Iwai <tiwai@xxxxxxx> >>>> Cc: Nadav Amit <nadav.amit@xxxxxxxxx> >>>> Cc: Mike Galbraith <efault@xxxxxx> >>>> Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> >>>> Cc: axie <axie@xxxxxxx> >>>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> >>>> --- >>>> mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++--- >>>> 1 file changed, 41 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index c8993c63eb25..da97ed525088 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, >>>> .address = address, >>>> .flags = PVMW_SYNC, >>>> }; >>>> + unsigned long start = address, end; >>>> int *cleaned = arg; >>>> >>>> + /* >>>> + * We have to assume the worse case ie pmd for invalidation. Note that >>>> + * the page can not be free from this function. >>>> + */ >>>> + end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE); >>>> + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end); >>>> + >>>> while (page_vma_mapped_walk(&pvmw)) { >>>> + unsigned long cstart, cend; >>>> int ret = 0; >>>> - address = pvmw.address; >>>> + >>>> + cstart = address = pvmw.address; >>>> if (pvmw.pte) { >>>> pte_t entry; >>>> pte_t *pte = pvmw.pte; >>>> @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, >>>> entry = pte_wrprotect(entry); >>>> entry = pte_mkclean(entry); >>>> set_pte_at(vma->vm_mm, address, pte, entry); >>>> + cend = cstart + PAGE_SIZE; >>>> ret = 1; >>>> } else { >>>> #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE >>>> @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, >>>> entry = pmd_wrprotect(entry); >>>> entry = pmd_mkclean(entry); >>>> set_pmd_at(vma->vm_mm, address, pmd, entry); >>>> + cstart &= PMD_MASK; >>>> + cend = cstart + PMD_SIZE; >>>> ret = 1; >>>> #else >>>> /* unexpected pmd-mapped page? */ >>>> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, >>>> } >>>> >>>> if (ret) { >>>> - mmu_notifier_invalidate_page(vma->vm_mm, address); >>>> + mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); >>>> (*cleaned)++; >>>> } >>>> } >>>> >>>> + mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); >>>> + >>>> return true; >>>> } >>>> >>>> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>>> pte_t pteval; >>>> struct page *subpage; >>>> bool ret = true; >>>> + unsigned long start = address, end; >>>> enum ttu_flags flags = (enum ttu_flags)arg; >>>> >>>> /* munlock has nothing to gain from examining un-locked vmas */ >>>> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>>> flags & TTU_MIGRATION, page); >>>> } >>>> >>>> + /* >>>> + * We have to assume the worse case ie pmd for invalidation. Note that >>>> + * the page can not be free in this function as call of try_to_unmap() >>>> + * must hold a reference on the page. >>>> + */ >>>> + end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE); >>>> + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end); >>>> + >>>> while (page_vma_mapped_walk(&pvmw)) { >>>> /* >>>> * If the page is mlock()d, we cannot swap it out. >>>> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>>> set_huge_swap_pte_at(mm, address, >>>> pvmw.pte, pteval, >>>> vma_mmu_pagesize(vma)); >>>> + mmu_notifier_invalidate_range(mm, address, >>>> + address + vma_mmu_pagesize(vma)); >>> >>> I don’t think that the notifier should be called after the PTE is set, but >>> after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) or >>> access/dirty bits are cleared. [There is an exception: if the PFN in the PTE >>> is changed without clearing the PTE before, but it does not apply here, and >>> there is a different notifier for this case.] >>> >>> Therefore, IIUC, try_to_umap_one() should only call >>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and >>> ptep_clear_flush() are called. All the other calls to >>> mmu_notifier_invalidate_range() in this function can be removed. >> >> Yes it would simplify the patch, i was trying to optimize for the case >> where we restore the pte to its original value after ptep_clear_flush() >> or ptep_get_and_clear() as in this case there is no need to invalidate >> any secondary page table but that's an overkill optimization that just >> add too much complexity. > > Interesting. Actually, prior to your changes, it seems that the break > statements would skip mmu_notifier_invalidate_page() when the PTE is not > changed. So why not just change mmu_notifier_invalidate_page() into > mmu_notifier_invalidate_range() ? Sorry - I noticed it was actually wrong before (as you noted) at least in one case: (unlikely(PageSwapBacked(page) != PageSwapCache(page))) Regards, Nadav -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href