On 2024/12/5 06:47, Jann Horn wrote:
On Wed, Dec 4, 2024 at 11:36 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
On Wed, 4 Dec 2024 19:09:49 +0800 Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote:
As a first step, this commit aims to synchronously free the empty PTE
pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
cases other than madvise(MADV_DONTNEED).
Once an empty PTE is detected, we first try to hold the pmd lock within
the pte lock. If successful, we clear the pmd entry directly (fast path).
Otherwise, we wait until the pte lock is released, then re-hold the pmd
and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
whether the PTE page is empty and free it (slow path).
"wait until the pte lock is released" sounds nasty. I'm not
immediately seeing the code which does this. PLease provide more
description?
It's worded a bit confusingly, but it's fine; a better description
might be "if try_get_and_clear_pmd() fails to trylock the PMD lock
(against lock order), then later, after we have dropped the PTE lock,
try_to_free_pte() takes the PMD and PTE locks in the proper lock
order".
The "wait until the pte lock is released" part is just supposed to
mean that the try_to_free_pte() call is placed after the point where
the PTE lock has been dropped (which makes it possible to take the PMD
lock). It does not refer to waiting for other threads.
Yes. Thanks for helping to clarify my vague statement!
+void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+ struct mmu_gather *tlb)
+{
+ pmd_t pmdval;
+ spinlock_t *pml, *ptl;
+ pte_t *start_pte, *pte;
+ int i;
+
+ pml = pmd_lock(mm, pmd);
+ start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
+ if (!start_pte)
+ goto out_ptl;
+ if (ptl != pml)
+ spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+
+ /* Check if it is empty PTE page */
+ for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
+ if (!pte_none(ptep_get(pte)))
+ goto out_ptl;
+ }
Are there any worst-case situations in which we'll spend uncceptable
mounts of time running this loop?
This loop is just over a single page table, that should be no more
expensive than what we already do in other common paths like
zap_pte_range().
Agree.