Re: [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 21 Jul 2021 15:41:05 +0000

On Tue, Jul 20, 2021, Yang Shi wrote:
> On Tue, Jul 20, 2021 at 2:04 PM Zi Yan <ziy@xxxxxxxxxx> wrote:
> >
> > On 20 Jul 2021, at 16:53, Yang Shi wrote:
> >
> > > On Tue, Jul 20, 2021 at 7:25 AM Christian Borntraeger
> > > <borntraeger@xxxxxxxxxx> wrote:
> > >>> -     if (mm_tlb_flush_pending(vma->vm_mm)) {
> > >>> -             flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
> > >>> -             /*
> > >>> -              * change_huge_pmd() released the pmd lock before
> > >>> -              * invalidating the secondary MMUs sharing the primary
> > >>> -              * MMU pagetables (with ->invalidate_range()). The
> > >>> -              * mmu_notifier_invalidate_range_end() (which
> > >>> -              * internally calls ->invalidate_range()) in
> > >>> -              * change_pmd_range() will run after us, so we can't
> > >>> -              * rely on it here and we need an explicit invalidate.
> > >>> -              */
> > >>> -             mmu_notifier_invalidate_range(vma->vm_mm, haddr,
> > >>> -                                           haddr + HPAGE_PMD_SIZE);
> > >>> -     }
> > >>> CC Paolo/KVM list so we also remove the mmu notifier here. Do we need those
> > >> now in migrate_pages? I am not an expert in that code, but I cant find
> > >> an equivalent mmu_notifier in migrate_misplaced_pages.
> > >> I might be totally wrong, just something that I noticed.
> > >
> > > Do you mean the missed mmu notifier invalidate for the THP migration
> > > case? Yes, I noticed that too. But I'm not sure whether it is intended
> > > or just missed.
> >
> > From my understand of mmu_notifier document, mmu_notifier_invalidate_range()
> > is needed only if the PTE is updated to point to a new page or the page pointed
> > by the PTE is freed. Page migration does not fall into either case.

The "new page" part of

  a page table entry is updated to point to a new page

is referring to a different physical page, i.e. a different pfn, not a different
struct page.  do_huge_pmd_numa_page() is moving a THP between nodes, thus it's
changing the backing pfn and needs to invalidate secondary MMUs at some point.

> > In addition, in migrate_pages(), more specifically try_to_migrate_one(),
> > there is a pair of mmu_notifier_invalidate_range_start() and
> > mmu_notifier_invalidate_range_end() around the PTE manipulation code, which should
> > be sufficient to notify secondary TLBs (including KVM) about the PTE change
> > for page migration. Correct me if I am wrong.
> 
> Thanks, I think you are correct. By looking into commit 7066f0f933a1
> ("mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()"),
> the tlb flush and mmu notifier invalidate were needed since the old
> numa fault implementation didn't change PTE to migration entry so it
> may cause data corruption due to the writes from GPU secondary MMU.
> 
> The refactor does use the generic migration code which converts PTE to
> migration entry before copying data to the new page.

That's my understanding as well, based on this blurb from commit 7066f0f933a1.

    The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
    uses the generic migrate_pages which transitions the pte from
    numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
    and all mmu notifiers there before copying the page.

That analysis/justification for removing the invalidate_range() call should be
captured in the changelog.  Confirmation from Andrea would be a nice bonus.