On Mon, Dec 16, 2024 at 9:09 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 16.12.24 17:51, Dev Jain wrote: > > In contrast to PMD-collapse, we do not need to operate on two levels of pagetable > > simultaneously. Therefore, downgrade the mmap lock from write to read mode. Still > > take the anon_vma lock in exclusive mode so as to not waste time in the rmap path, > > which is anyways going to fail since the PTEs are going to be changed. Under the PTL, > > copy page contents, clear the PTEs, remove folio pins, and (try to) unmap the > > old folios. Set the PTEs to the new folio using the set_ptes() API. > > > > Signed-off-by: Dev Jain <dev.jain@xxxxxxx> > > --- > > Note: I have been trying hard to get rid of the locks in here: we still are > > taking the PTL around the page copying; dropping the PTL and taking it after > > the copying should lead to a deadlock, for example: > > khugepaged madvise(MADV_COLD) > > folio_lock() lock(ptl) > > lock(ptl) folio_lock() > > > > We can create a locked folio list, altogether drop both the locks, take the PTL, > > do everything which __collapse_huge_page_isolate() does *except* the isolation and > > again try locking folios, but then it will reduce efficiency of khugepaged > > and almost looks like a forced solution :) > > Please note the following discussion if anyone is interested: > > https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@xxxxxxx/ > > (Apologies for not CCing the mailing list from the start) > > > > mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++---------- > > 1 file changed, 87 insertions(+), 21 deletions(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 88beebef773e..8040b130e677 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -714,24 +714,28 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, > > struct vm_area_struct *vma, > > unsigned long address, > > spinlock_t *ptl, > > - struct list_head *compound_pagelist) > > + struct list_head *compound_pagelist, int order) > > { > > struct folio *src, *tmp; > > pte_t *_pte; > > pte_t pteval; > > > > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; > > + for (_pte = pte; _pte < pte + (1UL << order); > > _pte++, address += PAGE_SIZE) { > > pteval = ptep_get(_pte); > > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > > add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > > if (is_zero_pfn(pte_pfn(pteval))) { > > - /* > > - * ptl mostly unnecessary. > > - */ > > - spin_lock(ptl); > > - ptep_clear(vma->vm_mm, address, _pte); > > - spin_unlock(ptl); > > + if (order == HPAGE_PMD_ORDER) { > > + /* > > + * ptl mostly unnecessary. > > + */ > > + spin_lock(ptl); > > + ptep_clear(vma->vm_mm, address, _pte); > > + spin_unlock(ptl); > > + } else { > > + ptep_clear(vma->vm_mm, address, _pte); > > + } > > ksm_might_unmap_zero_page(vma->vm_mm, pteval); > > } > > } else { > > @@ -740,15 +744,20 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, > > src = page_folio(src_page); > > if (!folio_test_large(src)) > > release_pte_folio(src); > > - /* > > - * ptl mostly unnecessary, but preempt has to > > - * be disabled to update the per-cpu stats > > - * inside folio_remove_rmap_pte(). > > - */ > > - spin_lock(ptl); > > - ptep_clear(vma->vm_mm, address, _pte); > > - folio_remove_rmap_pte(src, src_page, vma); > > - spin_unlock(ptl); > > + if (order == HPAGE_PMD_ORDER) { > > + /* > > + * ptl mostly unnecessary, but preempt has to > > + * be disabled to update the per-cpu stats > > + * inside folio_remove_rmap_pte(). > > + */ > > + spin_lock(ptl); > > + ptep_clear(vma->vm_mm, address, _pte); > > > > > > + folio_remove_rmap_pte(src, src_page, vma); > > + spin_unlock(ptl); I think it is ok not to take the ptl since the preempt is disabled at this point by pte_map(). pte_unmap() is called after copy. > > + } else { > > + ptep_clear(vma->vm_mm, address, _pte); > > + folio_remove_rmap_pte(src, src_page, vma); > > + } > > As I've talked to Nico about this code recently ... :) > > Are you clearing the PTE after the copy succeeded? If so, where is the > TLB flush? > > How do you sync against concurrent write acess + GUP-fast? > > > The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check if > there are unexpected page references (e.g., GUP) if so back off (3) > copy page content (4) set updated PTE/PMD. Yeah, either PMD is not cleared or tlb_remove_table_sync_one() is not called IIRC, the concurrent GUP may change the refcount after the refcount check. > > To Nico, I suggested doing it simple initially, and still clear the > high-level PMD entry + flush under mmap write lock, then re-map the PTE > table after modifying the page table. It's not as efficient, but "harder > to get wrong". > > Maybe that's already happening, but I stumbled over this clearing logic > in __collapse_huge_page_copy_succeeded(), so I'm curious. > > -- > Cheers, > > David / dhildenb > >