Note there's a v4 :) On Mon, Oct 28, 2024 at 10:29:44PM +0200, Jarkko Sakkinen wrote: > On Wed Oct 23, 2024 at 7:24 PM EEST, Lorenzo Stoakes wrote: > > The existing generic pagewalk logic permits the walking of page tables, > > invoking callbacks at individual page table levels via user-provided > > mm_walk_ops callbacks. > > > > This is useful for traversing existing page table entries, but precludes > > the ability to establish new ones. > > > > Existing mechanism for performing a walk which also installs page table > > entries if necessary are heavily duplicated throughout the kernel, each > > with semantic differences from one another and largely unavailable for use > > elsewhere. > > > > Rather than add yet another implementation, we extend the generic pagewalk > > logic to enable the installation of page table entries by adding a new > > install_pte() callback in mm_walk_ops. If this is specified, then upon > > encountering a missing page table entry, we allocate and install a new one > > and continue the traversal. > > > > If a THP huge page is encountered at either the PMD or PUD level we split > > it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level), > > otherwise if there is only an ops->install_pte(), we avoid the unnecessary > > split. > > Just for interest: does this mean that the operation is always > "destructive" (i.e. modifying state) even when install_pte() does not > do anything, i.e. does the split alway happen despite what the callback > does? Not really expert on this but this paragraph won't leave me > alone :-) Well no as I say (perhaps not clearly) if something already exists we don't even split huge pages. We only do so if you explicitly ask to examine page tables levels below those where a huge page may exist. The guard page code goes to great lengths to avoid this in all cases and doesn't split at all. > > > > > We do not support hugetlb at this stage. > > > > If this function returns an error, or an allocation fails during the > > operation, we abort the operation altogether. It is up to the caller to > > deal appropriately with partially populated page table ranges. > > > > If install_pte() is defined, the semantics of pte_entry() change - this > > callback is then only invoked if the entry already exists. This is a useful > > property, as it allows a caller to handle existing PTEs while installing > > new ones where necessary in the specified range. > > > > If install_pte() is not defined, then there is no functional difference to > > this patch, so all existing logic will work precisely as it did before. > > > > As we only permit the installation of PTEs where a mapping does not already > > exist there is no need for TLB management, however we do invoke > > update_mmu_cache() for architectures which require manual maintenance of > > mappings for other CPUs. > > > > We explicitly do not allow the existing page walk API to expose this > > feature as it is dangerous and intended for internal mm use only. Therefore > > we provide a new walk_page_range_mm() function exposed only to > > mm/internal.h. > > > > Reviewed-by: Jann Horn <jannh@xxxxxxxxxx> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > --- > > include/linux/pagewalk.h | 18 +++- > > mm/internal.h | 6 ++ > > mm/pagewalk.c | 227 +++++++++++++++++++++++++++------------ > > 3 files changed, 182 insertions(+), 69 deletions(-) > > > > diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h > > index f5eb5a32aeed..9700a29f8afb 100644 > > --- a/include/linux/pagewalk.h > > +++ b/include/linux/pagewalk.h > > @@ -25,12 +25,15 @@ enum page_walk_lock { > > * this handler is required to be able to handle > > * pmd_trans_huge() pmds. They may simply choose to > > * split_huge_page() instead of handling it explicitly. > > - * @pte_entry: if set, called for each PTE (lowest-level) entry, > > - * including empty ones > > + * @pte_entry: if set, called for each PTE (lowest-level) entry > > + * including empty ones, except if @install_pte is set. > > + * If @install_pte is set, @pte_entry is called only for > > + * existing PTEs. > > * @pte_hole: if set, called for each hole at all levels, > > * depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD. > > * Any folded depths (where PTRS_PER_P?D is equal to 1) > > - * are skipped. > > + * are skipped. If @install_pte is specified, this will > > + * not trigger for any populated ranges. > > * @hugetlb_entry: if set, called for each hugetlb entry. This hook > > * function is called with the vma lock held, in order to > > * protect against a concurrent freeing of the pte_t* or > > @@ -51,6 +54,13 @@ enum page_walk_lock { > > * @pre_vma: if set, called before starting walk on a non-null vma. > > * @post_vma: if set, called after a walk on a non-null vma, provided > > * that @pre_vma and the vma walk succeeded. > > + * @install_pte: if set, missing page table entries are installed and > > + * thus all levels are always walked in the specified > > + * range. This callback is then invoked at the PTE level > > + * (having split any THP pages prior), providing the PTE to > > + * install. If allocations fail, the walk is aborted. This > > + * operation is only available for userland memory. Not > > + * usable for hugetlb ranges. > > Given that especially walk_page_range_novma() has bunch of call sites, > it would not hurt to mention here simply that only for mm-internal use > with not much other explanation. We explicitly document this in multiple places. A user will very quickly discover this is not available. I will adjust this blurb if I need to do a respin. > > > * > > * p?d_entry callbacks are called even if those levels are folded on a > > * particular architecture/configuration. > > @@ -76,6 +86,8 @@ struct mm_walk_ops { > > int (*pre_vma)(unsigned long start, unsigned long end, > > struct mm_walk *walk); > > void (*post_vma)(struct mm_walk *walk); > > + int (*install_pte)(unsigned long addr, unsigned long next, > > + pte_t *ptep, struct mm_walk *walk); > > enum page_walk_lock walk_lock; > > }; > > > > diff --git a/mm/internal.h b/mm/internal.h > > index 508f7802dd2b..fb1fb0c984e4 100644 > > --- a/mm/internal.h > > +++ b/mm/internal.h > > @@ -12,6 +12,7 @@ > > #include <linux/mm.h> > > #include <linux/mm_inline.h> > > #include <linux/pagemap.h> > > +#include <linux/pagewalk.h> > > #include <linux/rmap.h> > > #include <linux/swap.h> > > #include <linux/swapops.h> > > @@ -1451,4 +1452,9 @@ static inline void accept_page(struct page *page) > > } > > #endif /* CONFIG_UNACCEPTED_MEMORY */ > > > > +/* pagewalk.c */ > > +int walk_page_range_mm(struct mm_struct *mm, unsigned long start, > > + unsigned long end, const struct mm_walk_ops *ops, > > + void *private); > > + > > #endif /* __MM_INTERNAL_H */ > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > > index 5f9f01532e67..f3cbad384344 100644 > > --- a/mm/pagewalk.c > > +++ b/mm/pagewalk.c > > @@ -3,9 +3,14 @@ > > #include <linux/highmem.h> > > #include <linux/sched.h> > > #include <linux/hugetlb.h> > > +#include <linux/mmu_context.h> > > #include <linux/swap.h> > > #include <linux/swapops.h> > > > > +#include <asm/tlbflush.h> > > + > > +#include "internal.h" > > + > > /* > > * We want to know the real level where a entry is located ignoring any > > * folding of levels which may be happening. For example if p4d is folded then > > @@ -29,9 +34,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr, > > int err = 0; > > > > for (;;) { > > - err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk); > > - if (err) > > - break; > > + if (ops->install_pte && pte_none(ptep_get(pte))) { > > + pte_t new_pte; > > + > > + err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte, > > + walk); > > + if (err) > > + break; > > + > > + set_pte_at(walk->mm, addr, pte, new_pte); > > + /* Non-present before, so for arches that need it. */ > > + if (!WARN_ON_ONCE(walk->no_vma)) > > + update_mmu_cache(walk->vma, addr, pte); > > + } else { > > + err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk); > > + if (err) > > + break; > > + } > > if (addr >= end - PAGE_SIZE) > > break; > > addr += PAGE_SIZE; > > @@ -89,11 +108,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, > > again: > > next = pmd_addr_end(addr, end); > > if (pmd_none(*pmd)) { > > - if (ops->pte_hole) > > + if (ops->install_pte) > > + err = __pte_alloc(walk->mm, pmd); > > + else if (ops->pte_hole) > > err = ops->pte_hole(addr, next, depth, walk); > > if (err) > > break; > > - continue; > > + if (!ops->install_pte) > > + continue; > > } > > > > walk->action = ACTION_SUBTREE; > > @@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, > > > > if (walk->action == ACTION_AGAIN) > > goto again; > > - > > - /* > > - * Check this here so we only break down trans_huge > > - * pages when we _need_ to > > - */ > > - if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) || > > - walk->action == ACTION_CONTINUE || > > - !(ops->pte_entry)) > > + if (walk->action == ACTION_CONTINUE) > > continue; > > + if (!ops->install_pte && !ops->pte_entry) > > + continue; /* Nothing to do. */ > > + if (!ops->pte_entry && ops->install_pte && > > + pmd_present(*pmd) && > > + (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))) > > + continue; /* Avoid unnecessary split. */ > > > > if (walk->vma) > > split_huge_pmd(walk->vma, pmd, addr); > > + else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) > > + continue; /* Nothing to do. */ > > > > err = walk_pte_range(pmd, addr, next, walk); > > if (err) > > @@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, > > again: > > next = pud_addr_end(addr, end); > > if (pud_none(*pud)) { > > - if (ops->pte_hole) > > + if (ops->install_pte) > > + err = __pmd_alloc(walk->mm, pud, addr); > > + else if (ops->pte_hole) > > err = ops->pte_hole(addr, next, depth, walk); > > if (err) > > break; > > - continue; > > + if (!ops->install_pte) > > + continue; > > } > > > > walk->action = ACTION_SUBTREE; > > @@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, > > > > if (walk->action == ACTION_AGAIN) > > goto again; > > - > > - if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) || > > - walk->action == ACTION_CONTINUE || > > - !(ops->pmd_entry || ops->pte_entry)) > > + if (walk->action == ACTION_CONTINUE) > > continue; > > + if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry) > > + continue; /* Nothing to do. */ > > + if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte && > > + pud_present(*pud) && > > + (pud_trans_huge(*pud) || pud_devmap(*pud))) > > + continue; /* Avoid unnecessary split. */ > > > > if (walk->vma) > > split_huge_pud(walk->vma, pud, addr); > > + else if (pud_leaf(*pud) || !pud_present(*pud)) > > + continue; /* Nothing to do. */ > > + > > if (pud_none(*pud)) > > goto again; > > > > @@ -196,18 +228,22 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, > > do { > > next = p4d_addr_end(addr, end); > > if (p4d_none_or_clear_bad(p4d)) { > > - if (ops->pte_hole) > > + if (ops->install_pte) > > + err = __pud_alloc(walk->mm, p4d, addr); > > + else if (ops->pte_hole) > > err = ops->pte_hole(addr, next, depth, walk); > > if (err) > > break; > > - continue; > > + if (!ops->install_pte) > > + continue; > > } > > if (ops->p4d_entry) { > > err = ops->p4d_entry(p4d, addr, next, walk); > > if (err) > > break; > > } > > - if (ops->pud_entry || ops->pmd_entry || ops->pte_entry) > > + if (ops->pud_entry || ops->pmd_entry || ops->pte_entry || > > + ops->install_pte) > > err = walk_pud_range(p4d, addr, next, walk); > > if (err) > > break; > > @@ -231,18 +267,22 @@ static int walk_pgd_range(unsigned long addr, unsigned long end, > > do { > > next = pgd_addr_end(addr, end); > > if (pgd_none_or_clear_bad(pgd)) { > > - if (ops->pte_hole) > > + if (ops->install_pte) > > + err = __p4d_alloc(walk->mm, pgd, addr); > > + else if (ops->pte_hole) > > err = ops->pte_hole(addr, next, 0, walk); > > if (err) > > break; > > - continue; > > + if (!ops->install_pte) > > + continue; > > } > > if (ops->pgd_entry) { > > err = ops->pgd_entry(pgd, addr, next, walk); > > if (err) > > break; > > } > > - if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry) > > + if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || > > + ops->pte_entry || ops->install_pte) > > err = walk_p4d_range(pgd, addr, next, walk); > > if (err) > > break; > > @@ -334,6 +374,11 @@ static int __walk_page_range(unsigned long start, unsigned long end, > > int err = 0; > > struct vm_area_struct *vma = walk->vma; > > const struct mm_walk_ops *ops = walk->ops; > > + bool is_hugetlb = is_vm_hugetlb_page(vma); > > + > > + /* We do not support hugetlb PTE installation. */ > > + if (ops->install_pte && is_hugetlb) > > + return -EINVAL; > > > > if (ops->pre_vma) { > > err = ops->pre_vma(start, end, walk); > > @@ -341,7 +386,7 @@ static int __walk_page_range(unsigned long start, unsigned long end, > > return err; > > } > > > > - if (is_vm_hugetlb_page(vma)) { > > + if (is_hugetlb) { > > if (ops->hugetlb_entry) > > err = walk_hugetlb_range(start, end, walk); > > } else > > @@ -380,47 +425,14 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma, > > #endif > > } > > > > -/** > > - * walk_page_range - walk page table with caller specific callbacks > > - * @mm: mm_struct representing the target process of page table walk > > - * @start: start address of the virtual address range > > - * @end: end address of the virtual address range > > - * @ops: operation to call during the walk > > - * @private: private data for callbacks' usage > > - * > > - * Recursively walk the page table tree of the process represented by @mm > > - * within the virtual address range [@start, @end). During walking, we can do > > - * some caller-specific works for each entry, by setting up pmd_entry(), > > - * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these > > - * callbacks, the associated entries/pages are just ignored. > > - * The return values of these callbacks are commonly defined like below: > > - * > > - * - 0 : succeeded to handle the current entry, and if you don't reach the > > - * end address yet, continue to walk. > > - * - >0 : succeeded to handle the current entry, and return to the caller > > - * with caller specific value. > > - * - <0 : failed to handle the current entry, and return to the caller > > - * with error code. > > - * > > - * Before starting to walk page table, some callers want to check whether > > - * they really want to walk over the current vma, typically by checking > > - * its vm_flags. walk_page_test() and @ops->test_walk() are used for this > > - * purpose. > > - * > > - * If operations need to be staged before and committed after a vma is walked, > > - * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(), > > - * since it is intended to handle commit-type operations, can't return any > > - * errors. > > - * > > - * struct mm_walk keeps current values of some common data like vma and pmd, > > - * which are useful for the access from callbacks. If you want to pass some > > - * caller-specific data to callbacks, @private should be helpful. > > +/* > > + * See the comment for walk_page_range(), this performs the heavy lifting of the > > + * operation, only sets no restrictions on how the walk proceeds. > > * > > - * Locking: > > - * Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock, > > - * because these function traverse vma list and/or access to vma's data. > > + * We usually restrict the ability to install PTEs, but this functionality is > > + * available to internal memory management code and provided in mm/internal.h. > > */ > > -int walk_page_range(struct mm_struct *mm, unsigned long start, > > +int walk_page_range_mm(struct mm_struct *mm, unsigned long start, > > unsigned long end, const struct mm_walk_ops *ops, > > void *private) > > { > > @@ -479,6 +491,80 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, > > return err; > > } > > > > +/* > > + * Determine if the walk operations specified are permitted to be used for a > > + * page table walk. > > + * > > + * This check is performed on all functions which are parameterised by walk > > + * operations and exposed in include/linux/pagewalk.h. > > + * > > + * Internal memory management code can use the walk_page_range_mm() function to > > + * be able to use all page walking operations. > > + */ > > +static bool check_ops_valid(const struct mm_walk_ops *ops) > > +{ > > + /* > > + * The installation of PTEs is solely under the control of memory > > + * management logic and subject to many subtle locking, security and > > + * cache considerations so we cannot permit other users to do so, and > > + * certainly not for exported symbols. > > + */ > > + if (ops->install_pte) > > + return false; > > + > > + return true; > > or "return !!(ops->install_pte);" > > > +} > > Alternatively one could consider defining "struct mm_walk_internal_ops", > which would be only available in internal.h but I guess there is good > reasons to do it way it is. Yes. > > > + > > +/** > > + * walk_page_range - walk page table with caller specific callbacks > > + * @mm: mm_struct representing the target process of page table walk > > + * @start: start address of the virtual address range > > + * @end: end address of the virtual address range > > + * @ops: operation to call during the walk > > + * @private: private data for callbacks' usage > > + * > > + * Recursively walk the page table tree of the process represented by @mm > > + * within the virtual address range [@start, @end). During walking, we can do > > + * some caller-specific works for each entry, by setting up pmd_entry(), > > + * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these > > + * callbacks, the associated entries/pages are just ignored. > > + * The return values of these callbacks are commonly defined like below: > > + * > > + * - 0 : succeeded to handle the current entry, and if you don't reach the > > + * end address yet, continue to walk. > > + * - >0 : succeeded to handle the current entry, and return to the caller > > + * with caller specific value. > > + * - <0 : failed to handle the current entry, and return to the caller > > + * with error code. > > + * > > + * Before starting to walk page table, some callers want to check whether > > + * they really want to walk over the current vma, typically by checking > > + * its vm_flags. walk_page_test() and @ops->test_walk() are used for this > > + * purpose. > > + * > > + * If operations need to be staged before and committed after a vma is walked, > > + * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(), > > + * since it is intended to handle commit-type operations, can't return any > > + * errors. > > + * > > + * struct mm_walk keeps current values of some common data like vma and pmd, > > + * which are useful for the access from callbacks. If you want to pass some > > + * caller-specific data to callbacks, @private should be helpful. > > + * > > + * Locking: > > + * Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock, > > + * because these function traverse vma list and/or access to vma's data. > > + */ > > +int walk_page_range(struct mm_struct *mm, unsigned long start, > > + unsigned long end, const struct mm_walk_ops *ops, > > + void *private) > > +{ > > + if (!check_ops_valid(ops)) > > + return -EINVAL; > > + > > + return walk_page_range_mm(mm, start, end, ops, private); > > +} > > + > > /** > > * walk_page_range_novma - walk a range of pagetables not backed by a vma > > * @mm: mm_struct representing the target process of page table walk > > @@ -494,7 +580,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, > > * walking the kernel pages tables or page tables for firmware. > > * > > * Note: Be careful to walk the kernel pages tables, the caller may be need to > > - * take other effective approache (mmap lock may be insufficient) to prevent > > + * take other effective approaches (mmap lock may be insufficient) to prevent > > * the intermediate kernel page tables belonging to the specified address range > > * from being freed (e.g. memory hot-remove). > > */ > > @@ -513,6 +599,8 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start, > > > > if (start >= end || !walk.mm) > > return -EINVAL; > > + if (!check_ops_valid(ops)) > > + return -EINVAL; > > > > /* > > * 1) For walking the user virtual address space: > > @@ -556,6 +644,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start, > > return -EINVAL; > > if (start < vma->vm_start || end > vma->vm_end) > > return -EINVAL; > > + if (!check_ops_valid(ops)) > > + return -EINVAL; > > > > process_mm_walk_lock(walk.mm, ops->walk_lock); > > process_vma_walk_lock(vma, ops->walk_lock); > > @@ -574,6 +664,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops, > > > > if (!walk.mm) > > return -EINVAL; > > + if (!check_ops_valid(ops)) > > + return -EINVAL; > > > > process_mm_walk_lock(walk.mm, ops->walk_lock); > > process_vma_walk_lock(vma, ops->walk_lock); > > @@ -623,6 +715,9 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index, > > unsigned long start_addr, end_addr; > > int err = 0; > > > > + if (!check_ops_valid(ops)) > > + return -EINVAL; > > + > > lockdep_assert_held(&mapping->i_mmap_rwsem); > > vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index, > > first_index + nr - 1) { > > So I took a random patch set in order to learn how to compile Fedora Ark > kernel out of any upstream tree (mm in this case), thus making noise > here. > > With this goal, which mainly to be able to do such thing once or twice > per release cycle: > > Tested-by: Jarkko Sakkinen <jarkko@xxxxxxxxxx> > > BR, Jarkko Thanks!