On Mon, Sep 19, 2022 at 11:12 AM Yang Shi <shy828301@xxxxxxxxx> wrote: > > On Mon, Sep 19, 2022 at 8:29 AM Zach O'Keefe <zokeefe@xxxxxxxxxx> wrote: > > > > On Sep 16 13:38, Yang Shi wrote: > > > On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe <zokeefe@xxxxxxxxxx> wrote: > > > > > > > > Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed > > > > memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y). > > > > > > > > On success, the backing memory will be a hugepage. For the memory range > > > > and process provided, the page tables will synchronously have a huge pmd > > > > installed, mapping the THP. Other mappings of the file extent mapped by > > > > the memory range may be added to a set of entries that khugepaged will > > > > later process and attempt update their page tables to map the THP by a pmd. > > > > > > > > This functionality unlocks two important uses: > > > > > > > > (1) Immediately back executable text by THPs. Current support provided > > > > by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large > > > > system which might impair services from serving at their full rated > > > > load after (re)starting. Tricks like mremap(2)'ing text onto > > > > anonymous memory to immediately realize iTLB performance prevents > > > > page sharing and demand paging, both of which increase steady state > > > > memory footprint. Now, we can have the best of both worlds: Peak > > > > upfront performance and lower RAM footprints. > > > > > > > > (2) userfaultfd-based live migration of virtual machines satisfy UFFD > > > > faults by fetching native-sized pages over the network (to avoid > > > > latency of transferring an entire hugepage). However, after guest > > > > memory has been fully copied to the new host, MADV_COLLAPSE can > > > > be used to immediately increase guest performance. > > > > > > > > Since khugepaged is single threaded, this change now introduces > > > > possibility of collapse contexts racing in file collapse path. There a > > > > important few places to consider: > > > > > > > > (1) hpage_collapse_scan_file(), when we xas_pause() and drop RCU. > > > > We could have the memory collapsed out from under us, but > > > > the next xas_for_each() iteration will correctly pick up the > > > > hugepage. The hugepage might not be up to date (insofar as > > > > copying of small page contents might not have completed - the > > > > page still may be locked), but regardless what small page index > > > > we were iterating over, we'll find the hugepage and identify it > > > > as a suitably aligned compound page of order HPAGE_PMD_ORDER. > > > > > > > > In khugepaged path, we locklessly check the value of the pmd, > > > > and only add it to deferred collapse array if we find pmd > > > > mapping pte table. This is fine, since other values that could > > > > have raced in right afterwards denote failure, or that the > > > > memory was successfully collapsed, so we don't need further > > > > processing. > > > > > > > > In madvise path, we'll take mmap_lock() in write to serialize > > > > against page table updates and will know what to do based on the > > > > true value of the pmd: recheck all ptes if we point to a pte table, > > > > directly install the pmd, if the pmd has been cleared, but > > > > memory not yet faulted, or nothing at all if we find a huge pmd. > > > > > > > > It's worth putting emphasis here on how we treat the none pmd > > > > here. If khugepaged has processed this mm's page tables > > > > already, it will have left the pmd cleared (ready for refault by > > > > the process). Depending on the VMA flags and sysfs settings, > > > > amount of RAM on the machine, and the current load, could be a > > > > relatively common occurrence - and as such is one we'd like to > > > > handle successfully in MADV_COLLAPSE. When we see the none pmd > > > > in collapse_pte_mapped_thp(), we've locked mmap_lock in write > > > > and checked (a) huepaged_vma_check() to see if the backing > > > > memory is appropriate still, along with VMA sizing and > > > > appropriate hugepage alignment within the file, and (b) we've > > > > found a hugepage head of order HPAGE_PMD_ORDER at the offset > > > > in the file mapped by our hugepage-aligned virtual address. > > > > Even though the common-case is likely race with khugepaged, > > > > given these checks (regardless how we got here - we could be > > > > operating on a completely different file than originally checked > > > > in hpage_collapse_scan_file() for all we know) it should be safe > > > > to directly make the pmd a huge pmd pointing to this hugepage. > > > > > > > > (2) collapse_file() is mostly serialized on the same file extent by > > > > lock sequence: > > > > > > > > | lock hupepage > > > > | lock mapping->i_pages > > > > | lock 1st page > > > > | unlock mapping->i_pages > > > > | <page checks> > > > > | lock mapping->i_pages > > > > | page_ref_freeze(3) > > > > | xas_store(hugepage) > > > > | unlock mapping->i_pages > > > > | page_ref_unfreeze(1) > > > > | unlock 1st page > > > > V unlock hugepage > > > > > > > > Once a context (who already has their fresh hugepage locked) > > > > locks mapping->i_pages exclusively, it will hold said lock > > > > until it locks the first page, and it will hold that lock until > > > > the after the hugepage has been added to the page cache (and > > > > will unlock the hugepage after page table update, though that > > > > isn't important here). > > > > > > > > A racing context that loses the race for mapping->i_pages will > > > > then lose the race to locking the first page. Here - depending > > > > on how far the other racing context has gotten - we might find > > > > the new hugepage (in which case we'll exit cleanly when we > > > > check PageTransCompound()), or we'll find the "old" 1st small > > > > page (in which we'll exit cleanly when we discover unexpected > > > > refcount of 2 after isolate_lru_page()). This is assuming we > > > > are able to successfully lock the page we find - in shmem path, > > > > we could just fail the trylock and exit cleanly anyways. > > > > > > > > Failure path in collapse_file() is similar: once we hold lock > > > > on 1st small page, we are serialized against other collapse > > > > contexts. Before the 1st small page is unlocked, we add it > > > > back to the pagecache and unfreeze the refcount appropriately. > > > > Contexts who lost the race to the 1st small page will then find > > > > the same 1st small page with the correct refcount and will be > > > > able to proceed. > > > > > > > > Signed-off-by: Zach O'Keefe <zokeefe@xxxxxxxxxx> > > > > --- > > > > include/linux/khugepaged.h | 13 +- > > > > include/trace/events/huge_memory.h | 1 + > > > > kernel/events/uprobes.c | 2 +- > > > > mm/khugepaged.c | 238 ++++++++++++++++++++++------- > > > > 4 files changed, 194 insertions(+), 60 deletions(-) > > > > > > > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h > > > > index 384f034ae947..70162d707caf 100644 > > > > --- a/include/linux/khugepaged.h > > > > +++ b/include/linux/khugepaged.h > > > > @@ -16,11 +16,13 @@ extern void khugepaged_enter_vma(struct vm_area_struct *vma, > > > > unsigned long vm_flags); > > > > extern void khugepaged_min_free_kbytes_update(void); > > > > #ifdef CONFIG_SHMEM > > > > -extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr); > > > > +extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > > > > + bool install_pmd); > > > > #else > > > > -static inline void collapse_pte_mapped_thp(struct mm_struct *mm, > > > > - unsigned long addr) > > > > +static inline int collapse_pte_mapped_thp(struct mm_struct *mm, > > > > + unsigned long addr, bool install_pmd) > > > > { > > > > + return 0; > > > > } > > > > #endif > > > > > > > > @@ -46,9 +48,10 @@ static inline void khugepaged_enter_vma(struct vm_area_struct *vma, > > > > unsigned long vm_flags) > > > > { > > > > } > > > > -static inline void collapse_pte_mapped_thp(struct mm_struct *mm, > > > > - unsigned long addr) > > > > +static inline int collapse_pte_mapped_thp(struct mm_struct *mm, > > > > + unsigned long addr, bool install_pmd) > > > > { > > > > + return 0; > > > > } > > > > > > > > static inline void khugepaged_min_free_kbytes_update(void) > > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > > > > index fbbb25494d60..df33453b70fc 100644 > > > > --- a/include/trace/events/huge_memory.h > > > > +++ b/include/trace/events/huge_memory.h > > > > @@ -11,6 +11,7 @@ > > > > EM( SCAN_FAIL, "failed") \ > > > > EM( SCAN_SUCCEED, "succeeded") \ > > > > EM( SCAN_PMD_NULL, "pmd_null") \ > > > > + EM( SCAN_PMD_NONE, "pmd_none") \ > > > > EM( SCAN_PMD_MAPPED, "page_pmd_mapped") \ > > > > EM( SCAN_EXCEED_NONE_PTE, "exceed_none_pte") \ > > > > EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \ > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > index e0a9b945e7bc..d9e357b7e17c 100644 > > > > --- a/kernel/events/uprobes.c > > > > +++ b/kernel/events/uprobes.c > > > > @@ -555,7 +555,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, > > > > > > > > /* try collapse pmd for compound page */ > > > > if (!ret && orig_page_huge) > > > > - collapse_pte_mapped_thp(mm, vaddr); > > > > + collapse_pte_mapped_thp(mm, vaddr, false); > > > > > > > > return ret; > > > > } > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > > index 31ccf49cf279..66457a06b4e7 100644 > > > > --- a/mm/khugepaged.c > > > > +++ b/mm/khugepaged.c > > > > @@ -29,6 +29,7 @@ enum scan_result { > > > > SCAN_FAIL, > > > > SCAN_SUCCEED, > > > > SCAN_PMD_NULL, > > > > + SCAN_PMD_NONE, > > > > SCAN_PMD_MAPPED, > > > > SCAN_EXCEED_NONE_PTE, > > > > SCAN_EXCEED_SWAP_PTE, > > > > @@ -838,6 +839,18 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, > > > > if (!hugepage_vma_check(vma, vma->vm_flags, false, false, > > > > cc->is_khugepaged)) > > > > return SCAN_VMA_CHECK; > > > > + return SCAN_SUCCEED; > > > > +} > > > > + > > > > +static int hugepage_vma_revalidate_anon(struct mm_struct *mm, > > > > Hey Yang, > > > > Thanks for taking the time to review this series - particularly this patch, > > which I found tricky. > > > > > > > > Do we really need a new function for anon vma dedicatedly? Can't we > > > add a parameter to hugepage_vma_revalidate()? > > > > > > > Good point - at some point I think I utilized it more, but you're right that > > it it's overkill now. Have added a "expect_anon" argument to > > hugepage_vma_revalidate(). Thanks for the suggestions. > > > > > > + unsigned long address, > > > > + struct vm_area_struct **vmap, > > > > + struct collapse_control *cc) > > > > +{ > > > > + int ret = hugepage_vma_revalidate(mm, address, vmap, cc); > > > > + > > > > + if (ret != SCAN_SUCCEED) > > > > + return ret; > > > > /* > > > > * Anon VMA expected, the address may be unmapped then > > > > * remapped to file after khugepaged reaquired the mmap_lock. > > > > @@ -845,8 +858,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, > > > > * hugepage_vma_check may return true for qualified file > > > > * vmas. > > > > */ > > > > - if (!vma->anon_vma || !vma_is_anonymous(vma)) > > > > - return SCAN_VMA_CHECK; > > > > + if (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap)) > > > > + return SCAN_PAGE_ANON; > > > > return SCAN_SUCCEED; > > > > } > > > > > > > > @@ -866,8 +879,8 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm, > > > > /* See comments in pmd_none_or_trans_huge_or_clear_bad() */ > > > > barrier(); > > > > #endif > > > > - if (!pmd_present(pmde)) > > > > - return SCAN_PMD_NULL; > > > > + if (pmd_none(pmde)) > > > > + return SCAN_PMD_NONE; > > > > if (pmd_trans_huge(pmde)) > > > > return SCAN_PMD_MAPPED; > > > > if (pmd_bad(pmde)) > > > > @@ -995,7 +1008,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > > > goto out_nolock; > > > > > > > > mmap_read_lock(mm); > > > > - result = hugepage_vma_revalidate(mm, address, &vma, cc); > > > > + result = hugepage_vma_revalidate_anon(mm, address, &vma, cc); > > > > if (result != SCAN_SUCCEED) { > > > > mmap_read_unlock(mm); > > > > goto out_nolock; > > > > @@ -1026,7 +1039,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > > > * handled by the anon_vma lock + PG_lock. > > > > */ > > > > mmap_write_lock(mm); > > > > - result = hugepage_vma_revalidate(mm, address, &vma, cc); > > > > + result = hugepage_vma_revalidate_anon(mm, address, &vma, cc); > > > > if (result != SCAN_SUCCEED) > > > > goto out_up_write; > > > > /* check if the pmd is still valid */ > > > > @@ -1332,13 +1345,44 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm, > > > > slot = mm_slot_lookup(mm_slots_hash, mm); > > > > mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot); > > > > if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) { > > > > + int i; > > > > + /* > > > > + * Multiple callers may be adding entries here. Do a quick > > > > + * check to see the entry hasn't already been added by someone > > > > + * else. > > > > + */ > > > > + for (i = 0; i < mm_slot->nr_pte_mapped_thp; ++i) > > > > + if (mm_slot->pte_mapped_thp[i] == addr) > > > > + goto out; > > > > > > I don't quite get why we need this. I'm supposed just khugepaged could > > > add the addr to the array and MADV_COLLAPSE just handles pte-mapped > > > hugepage immediately IIRC, right? If so there is actually no change on > > > khugepaged side. > > > > > > > So you're right to say that this change isn't needed. The "multi-add" > > sequence is: > > > > (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A, > > emptying the A's ->pte_mapped_thp[] array. > > (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and > > retract_page_tables() finds a VMA in mm_struct A mapping the same extent > > (at virtual address X) and adds an entry (for X) into mm_struct A's > > ->pte-mapped_thp[] array. > > (3) khugepaged calls khugepagedge_collapse_scan_file() for mm_struct A at X, > > sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry (for X) > > into mm_struct A's ->pte-mapped_thp[] array. > > > > Which is somewhat contrived/rare - but it can occur. If we don't have this, > > the second time we call collapse_pte_mapped_thp() for the same > > mm_struct/address, we should take the "if (result == SCAN_PMD_MAPPED) {...}" > > branch early and return before grabbing any other locks (we already have > > exclusive mmap_lock). So, perhaps we can drop this check? > > > > > > mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr; > > > > ret = true; > > > > } > > > > +out: > > > > spin_unlock(&khugepaged_mm_lock); > > > > return ret; > > > > } > > > > > > > > +/* hpage must be locked, and mmap_lock must be held in write */ > > > > +static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, > > > > + pmd_t *pmdp, struct page *hpage) > > > > +{ > > > > + struct vm_fault vmf = { > > > > + .vma = vma, > > > > + .address = addr, > > > > + .flags = 0, > > > > + .pmd = pmdp, > > > > + }; > > > > + > > > > + VM_BUG_ON(!PageTransHuge(hpage)); > > > > + mmap_assert_write_locked(vma->vm_mm); > > > > + > > > > + if (do_set_pmd(&vmf, hpage)) > > > > + return SCAN_FAIL; > > > > + > > > > + get_page(hpage); > > > > + return SCAN_SUCCEED; > > > > +} > > > > + > > > > static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma, > > > > unsigned long addr, pmd_t *pmdp) > > > > { > > > > @@ -1360,12 +1404,14 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v > > > > * > > > > * @mm: process address space where collapse happens > > > > * @addr: THP collapse address > > > > + * @install_pmd: If a huge PMD should be installed > > > > * > > > > * This function checks whether all the PTEs in the PMD are pointing to the > > > > * right THP. If so, retract the page table so the THP can refault in with > > > > - * as pmd-mapped. > > > > + * as pmd-mapped. Possibly install a huge PMD mapping the THP. > > > > */ > > > > -void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) > > > > +int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > > > > + bool install_pmd) > > > > { > > > > unsigned long haddr = addr & HPAGE_PMD_MASK; > > > > struct vm_area_struct *vma = vma_lookup(mm, haddr); > > > > @@ -1380,12 +1426,12 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) > > > > > > > > /* Fast check before locking page if already PMD-mapped */ > > > > result = find_pmd_or_thp_or_none(mm, haddr, &pmd); > > > > - if (result != SCAN_SUCCEED) > > > > - return; > > > > + if (result == SCAN_PMD_MAPPED) > > > > + return result; > > > > > > > > if (!vma || !vma->vm_file || > > > > !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE)) > > > > - return; > > > > + return SCAN_VMA_CHECK; > > > > > > > > /* > > > > * If we are here, we've succeeded in replacing all the native pages > > > > @@ -1395,24 +1441,39 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) > > > > * analogously elide sysfs THP settings here. > > > > */ > > > > if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false)) > > > > - return; > > > > + return SCAN_VMA_CHECK; > > > > > > > > /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ > > > > if (userfaultfd_wp(vma)) > > > > - return; > > > > + return SCAN_PTE_UFFD_WP; > > > > > > > > hpage = find_lock_page(vma->vm_file->f_mapping, > > > > linear_page_index(vma, haddr)); > > > > if (!hpage) > > > > - return; > > > > + return SCAN_PAGE_NULL; > > > > > > > > - if (!PageHead(hpage)) > > > > + if (!PageHead(hpage)) { > > > > + result = SCAN_FAIL; > > > > > > I don't think you could trust this must be a HPAGE_PMD_ORDER hugepage > > > anymore since the vma might point to a different file, so a different > > > page cache. And the current kernel does support arbitrary order of > > > large foios for page cache. [...] > > > > Good catch! Yes, I think we need to double check HPAGE_PMD_ORDER here, > > and that applies equally to khugepaged as well. > > BTW, it should be better to have a separate patch to fix this issue as > a prerequisite of this series. Yes, good idea. Will send that patch out hopefully by EOD. Best, Zach > > > > > [...] The below pte traverse may remove rmap for > > > the wrong page IIUC. Khugepaged should experience the same problem as > > > well. > > > > > > > Just to confirm, you mean this is only a danger if we don't check the compound > > order, correct? I.e. if compound_order < HPAGE_PMD_ORDER we'll iterate over > > ptes that map something other than our compound page and erroneously adjust rmap > > for wrong pages. So, adding a check for compound_order == HPAGE_PMD_ORDER above > > alleviates this possibility. > > > > > > goto drop_hpage; > > > > + } > > > > > > > > - if (find_pmd_or_thp_or_none(mm, haddr, &pmd) != SCAN_SUCCEED) > > > > + result = find_pmd_or_thp_or_none(mm, haddr, &pmd); > > > > + switch (result) { > > > > + case SCAN_SUCCEED: > > > > + break; > > > > + case SCAN_PMD_NONE: > > > > + /* > > > > + * In MADV_COLLAPSE path, possible race with khugepaged where > > > > + * all pte entries have been removed and pmd cleared. If so, > > > > + * skip all the pte checks and just update the pmd mapping. > > > > + */ > > > > + goto maybe_install_pmd; > > > > + default: > > > > goto drop_hpage; > > > > + } > > > > > > > > start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); > > > > + result = SCAN_FAIL; > > > > > > > > /* step 1: check all mapped PTEs are to the right huge page */ > > > > for (i = 0, addr = haddr, pte = start_pte; > > > > @@ -1424,8 +1485,10 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) > > > > continue; > > > > > > > > /* page swapped out, abort */ > > > > - if (!pte_present(*pte)) > > > > + if (!pte_present(*pte)) { > > > > + result = SCAN_PTE_NON_PRESENT; > > > > goto abort; > > > > + } > > > > > > > > page = vm_normal_page(vma, addr, *pte); > > > > if (WARN_ON_ONCE(page && is_zone_device_page(page))) > > > > @@ -1460,12 +1523,19 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) > > > > add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count); > > > > } > > > > > > > > - /* step 4: collapse pmd */ > > > > + /* step 4: remove pte entries */ > > > > > > It also collapses and flushes pmd. > > > > > > > True, will update the comment. > > > > Thanks again for your time, > > Zach > > > > > > collapse_and_free_pmd(mm, vma, haddr, pmd); > > > > + > > > > +maybe_install_pmd: > > > > + /* step 5: install pmd entry */ > > > > + result = install_pmd > > > > + ? set_huge_pmd(vma, haddr, pmd, hpage) > > > > + : SCAN_SUCCEED; > > > > + > > > > drop_hpage: > > > > unlock_page(hpage); > > > > put_page(hpage); > > > > - return; > > > > + return result; > > > > > > > > abort: > > > > pte_unmap_unlock(start_pte, ptl); > > > > @@ -1488,22 +1558,29 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl > > > > goto out; > > > > > > > > for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++) > > > > - collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i]); > > > > + collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false); > > > > > > > > out: > > > > mm_slot->nr_pte_mapped_thp = 0; > > > > mmap_write_unlock(mm); > > > > } > > > > > > > > -static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) > > > > +static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > > > + struct mm_struct *target_mm, > > > > + unsigned long target_addr, struct page *hpage, > > > > + struct collapse_control *cc) > > > > { > > > > struct vm_area_struct *vma; > > > > - struct mm_struct *mm; > > > > - unsigned long addr; > > > > - pmd_t *pmd; > > > > + int target_result = SCAN_FAIL; > > > > > > > > i_mmap_lock_write(mapping); > > > > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { > > > > + int result = SCAN_FAIL; > > > > + struct mm_struct *mm = NULL; > > > > + unsigned long addr = 0; > > > > + pmd_t *pmd; > > > > + bool is_target = false; > > > > + > > > > /* > > > > * Check vma->anon_vma to exclude MAP_PRIVATE mappings that > > > > * got written to. These VMAs are likely not worth investing > > > > @@ -1520,24 +1597,34 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) > > > > * ptl. It has higher chance to recover THP for the VMA, but > > > > * has higher cost too. > > > > */ > > > > - if (vma->anon_vma) > > > > - continue; > > > > + if (vma->anon_vma) { > > > > + result = SCAN_PAGE_ANON; > > > > + goto next; > > > > + } > > > > addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); > > > > - if (addr & ~HPAGE_PMD_MASK) > > > > - continue; > > > > - if (vma->vm_end < addr + HPAGE_PMD_SIZE) > > > > - continue; > > > > + if (addr & ~HPAGE_PMD_MASK || > > > > + vma->vm_end < addr + HPAGE_PMD_SIZE) { > > > > + result = SCAN_VMA_CHECK; > > > > + goto next; > > > > + } > > > > mm = vma->vm_mm; > > > > - if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED) > > > > - continue; > > > > + is_target = mm == target_mm && addr == target_addr; > > > > + result = find_pmd_or_thp_or_none(mm, addr, &pmd); > > > > + if (result != SCAN_SUCCEED) > > > > + goto next; > > > > /* > > > > * We need exclusive mmap_lock to retract page table. > > > > * > > > > * We use trylock due to lock inversion: we need to acquire > > > > * mmap_lock while holding page lock. Fault path does it in > > > > * reverse order. Trylock is a way to avoid deadlock. > > > > + * > > > > + * Also, it's not MADV_COLLAPSE's job to collapse other > > > > + * mappings - let khugepaged take care of them later. > > > > */ > > > > - if (mmap_write_trylock(mm)) { > > > > + result = SCAN_PTE_MAPPED_HUGEPAGE; > > > > + if ((cc->is_khugepaged || is_target) && > > > > + mmap_write_trylock(mm)) { > > > > /* > > > > * When a vma is registered with uffd-wp, we can't > > > > * recycle the pmd pgtable because there can be pte > > > > @@ -1546,22 +1633,45 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) > > > > * it'll always mapped in small page size for uffd-wp > > > > * registered ranges. > > > > */ > > > > - if (!hpage_collapse_test_exit(mm) && > > > > - !userfaultfd_wp(vma)) > > > > - collapse_and_free_pmd(mm, vma, addr, pmd); > > > > + if (hpage_collapse_test_exit(mm)) { > > > > + result = SCAN_ANY_PROCESS; > > > > + goto unlock_next; > > > > + } > > > > + if (userfaultfd_wp(vma)) { > > > > + result = SCAN_PTE_UFFD_WP; > > > > + goto unlock_next; > > > > + } > > > > + collapse_and_free_pmd(mm, vma, addr, pmd); > > > > + if (!cc->is_khugepaged && is_target) > > > > + result = set_huge_pmd(vma, addr, pmd, hpage); > > > > + else > > > > + result = SCAN_SUCCEED; > > > > + > > > > +unlock_next: > > > > mmap_write_unlock(mm); > > > > - } else { > > > > - /* Try again later */ > > > > + goto next; > > > > + } > > > > + /* > > > > + * Calling context will handle target mm/addr. Otherwise, let > > > > + * khugepaged try again later. > > > > + */ > > > > + if (!is_target) { > > > > khugepaged_add_pte_mapped_thp(mm, addr); > > > > + continue; > > > > } > > > > +next: > > > > + if (is_target) > > > > + target_result = result; > > > > } > > > > i_mmap_unlock_write(mapping); > > > > + return target_result; > > > > } > > > > > > > > /** > > > > * collapse_file - collapse filemap/tmpfs/shmem pages into huge one. > > > > * > > > > * @mm: process address space where collapse happens > > > > + * @addr: virtual collapse start address > > > > * @file: file that collapse on > > > > * @start: collapse start address > > > > * @cc: collapse context and scratchpad > > > > @@ -1581,8 +1691,9 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) > > > > * + restore gaps in the page cache; > > > > * + unlock and free huge page; > > > > */ > > > > -static int collapse_file(struct mm_struct *mm, struct file *file, > > > > - pgoff_t start, struct collapse_control *cc) > > > > +static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > > + struct file *file, pgoff_t start, > > > > + struct collapse_control *cc) > > > > { > > > > struct address_space *mapping = file->f_mapping; > > > > struct page *hpage; > > > > @@ -1890,7 +2001,8 @@ static int collapse_file(struct mm_struct *mm, struct file *file, > > > > /* > > > > * Remove pte page tables, so we can re-fault the page as huge. > > > > */ > > > > - retract_page_tables(mapping, start); > > > > + result = retract_page_tables(mapping, start, mm, addr, hpage, > > > > + cc); > > > > unlock_page(hpage); > > > > hpage = NULL; > > > > } else { > > > > @@ -1946,8 +2058,9 @@ static int collapse_file(struct mm_struct *mm, struct file *file, > > > > return result; > > > > } > > > > > > > > -static int khugepaged_scan_file(struct mm_struct *mm, struct file *file, > > > > - pgoff_t start, struct collapse_control *cc) > > > > +static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, > > > > + struct file *file, pgoff_t start, > > > > + struct collapse_control *cc) > > > > { > > > > struct page *page = NULL; > > > > struct address_space *mapping = file->f_mapping; > > > > @@ -2035,7 +2148,7 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file, > > > > result = SCAN_EXCEED_NONE_PTE; > > > > count_vm_event(THP_SCAN_EXCEED_NONE_PTE); > > > > } else { > > > > - result = collapse_file(mm, file, start, cc); > > > > + result = collapse_file(mm, addr, file, start, cc); > > > > } > > > > } > > > > > > > > @@ -2043,8 +2156,9 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file, > > > > return result; > > > > } > > > > #else > > > > -static int khugepaged_scan_file(struct mm_struct *mm, struct file *file, > > > > - pgoff_t start, struct collapse_control *cc) > > > > +static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, > > > > + struct file *file, pgoff_t start, > > > > + struct collapse_control *cc) > > > > { > > > > BUILD_BUG(); > > > > } > > > > @@ -2142,8 +2256,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, > > > > khugepaged_scan.address); > > > > > > > > mmap_read_unlock(mm); > > > > - *result = khugepaged_scan_file(mm, file, pgoff, > > > > - cc); > > > > + *result = hpage_collapse_scan_file(mm, > > > > + khugepaged_scan.address, > > > > + file, pgoff, cc); > > > > mmap_locked = false; > > > > fput(file); > > > > } else { > > > > @@ -2449,10 +2564,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > > > > > > > > *prev = vma; > > > > > > > > - /* TODO: Support file/shmem */ > > > > - if (!vma->anon_vma || !vma_is_anonymous(vma)) > > > > - return -EINVAL; > > > > - > > > > if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false)) > > > > return -EINVAL; > > > > > > > > @@ -2483,16 +2594,35 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > > > > } > > > > mmap_assert_locked(mm); > > > > memset(cc->node_load, 0, sizeof(cc->node_load)); > > > > - result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked, > > > > - cc); > > > > + if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) { > > > > + struct file *file = get_file(vma->vm_file); > > > > + pgoff_t pgoff = linear_page_index(vma, addr); > > > > + > > > > + mmap_read_unlock(mm); > > > > + mmap_locked = false; > > > > + result = hpage_collapse_scan_file(mm, addr, file, pgoff, > > > > + cc); > > > > + fput(file); > > > > + } else { > > > > + result = hpage_collapse_scan_pmd(mm, vma, addr, > > > > + &mmap_locked, cc); > > > > + } > > > > if (!mmap_locked) > > > > *prev = NULL; /* Tell caller we dropped mmap_lock */ > > > > > > > > +handle_result: > > > > switch (result) { > > > > case SCAN_SUCCEED: > > > > case SCAN_PMD_MAPPED: > > > > ++thps; > > > > break; > > > > + case SCAN_PTE_MAPPED_HUGEPAGE: > > > > + BUG_ON(mmap_locked); > > > > + BUG_ON(*prev); > > > > + mmap_write_lock(mm); > > > > + result = collapse_pte_mapped_thp(mm, addr, true); > > > > + mmap_write_unlock(mm); > > > > + goto handle_result; > > > > /* Whitelisted set of results where continuing OK */ > > > > case SCAN_PMD_NULL: > > > > case SCAN_PTE_NON_PRESENT: > > > > -- > > > > 2.37.2.789.g6183377224-goog > > > >