On Tue, Jul 20, 2021, Yang Shi wrote: > On Tue, Jul 20, 2021 at 2:04 PM Zi Yan <ziy@xxxxxxxxxx> wrote: > > > > On 20 Jul 2021, at 16:53, Yang Shi wrote: > > > > > On Tue, Jul 20, 2021 at 7:25 AM Christian Borntraeger > > > <borntraeger@xxxxxxxxxx> wrote: > > >>> - if (mm_tlb_flush_pending(vma->vm_mm)) { > > >>> - flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); > > >>> - /* > > >>> - * change_huge_pmd() released the pmd lock before > > >>> - * invalidating the secondary MMUs sharing the primary > > >>> - * MMU pagetables (with ->invalidate_range()). The > > >>> - * mmu_notifier_invalidate_range_end() (which > > >>> - * internally calls ->invalidate_range()) in > > >>> - * change_pmd_range() will run after us, so we can't > > >>> - * rely on it here and we need an explicit invalidate. > > >>> - */ > > >>> - mmu_notifier_invalidate_range(vma->vm_mm, haddr, > > >>> - haddr + HPAGE_PMD_SIZE); > > >>> - } > > >>> CC Paolo/KVM list so we also remove the mmu notifier here. Do we need those > > >> now in migrate_pages? I am not an expert in that code, but I cant find > > >> an equivalent mmu_notifier in migrate_misplaced_pages. > > >> I might be totally wrong, just something that I noticed. > > > > > > Do you mean the missed mmu notifier invalidate for the THP migration > > > case? Yes, I noticed that too. But I'm not sure whether it is intended > > > or just missed. > > > > From my understand of mmu_notifier document, mmu_notifier_invalidate_range() > > is needed only if the PTE is updated to point to a new page or the page pointed > > by the PTE is freed. Page migration does not fall into either case. The "new page" part of a page table entry is updated to point to a new page is referring to a different physical page, i.e. a different pfn, not a different struct page. do_huge_pmd_numa_page() is moving a THP between nodes, thus it's changing the backing pfn and needs to invalidate secondary MMUs at some point. > > In addition, in migrate_pages(), more specifically try_to_migrate_one(), > > there is a pair of mmu_notifier_invalidate_range_start() and > > mmu_notifier_invalidate_range_end() around the PTE manipulation code, which should > > be sufficient to notify secondary TLBs (including KVM) about the PTE change > > for page migration. Correct me if I am wrong. > > Thanks, I think you are correct. By looking into commit 7066f0f933a1 > ("mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()"), > the tlb flush and mmu notifier invalidate were needed since the old > numa fault implementation didn't change PTE to migration entry so it > may cause data corruption due to the writes from GPU secondary MMU. > > The refactor does use the generic migration code which converts PTE to > migration entry before copying data to the new page. That's my understanding as well, based on this blurb from commit 7066f0f933a1. The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and uses the generic migrate_pages which transitions the pte from numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs and all mmu notifiers there before copying the page. That analysis/justification for removing the invalidate_range() call should be captured in the changelog. Confirmation from Andrea would be a nice bonus.