On Wed, Dec 4, 2024 at 11:36 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Wed, 4 Dec 2024 19:09:49 +0800 Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote: > > As a first step, this commit aims to synchronously free the empty PTE > > pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE > > pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude > > cases other than madvise(MADV_DONTNEED). > > > > Once an empty PTE is detected, we first try to hold the pmd lock within > > the pte lock. If successful, we clear the pmd entry directly (fast path). > > Otherwise, we wait until the pte lock is released, then re-hold the pmd > > and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect > > whether the PTE page is empty and free it (slow path). > > "wait until the pte lock is released" sounds nasty. I'm not > immediately seeing the code which does this. PLease provide more > description? It's worded a bit confusingly, but it's fine; a better description might be "if try_get_and_clear_pmd() fails to trylock the PMD lock (against lock order), then later, after we have dropped the PTE lock, try_to_free_pte() takes the PMD and PTE locks in the proper lock order". The "wait until the pte lock is released" part is just supposed to mean that the try_to_free_pte() call is placed after the point where the PTE lock has been dropped (which makes it possible to take the PMD lock). It does not refer to waiting for other threads. > > +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, > > + struct mmu_gather *tlb) > > +{ > > + pmd_t pmdval; > > + spinlock_t *pml, *ptl; > > + pte_t *start_pte, *pte; > > + int i; > > + > > + pml = pmd_lock(mm, pmd); > > + start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl); > > + if (!start_pte) > > + goto out_ptl; > > + if (ptl != pml) > > + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); > > + > > + /* Check if it is empty PTE page */ > > + for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) { > > + if (!pte_none(ptep_get(pte))) > > + goto out_ptl; > > + } > > Are there any worst-case situations in which we'll spend uncceptable > mounts of time running this loop? This loop is just over a single page table, that should be no more expensive than what we already do in other common paths like zap_pte_range().