On Mon, Oct 21, 2024 at 03:27:55PM +0200, Vlastimil Babka wrote: > On 10/20/24 18:20, Lorenzo Stoakes wrote: > > The existing generic pagewalk logic permits the walking of page tables, > > invoking callbacks at individual page table levels via user-provided > > mm_walk_ops callbacks. > > > > This is useful for traversing existing page table entries, but precludes > > the ability to establish new ones. > > > > Existing mechanism for performing a walk which also installs page table > > entries if necessary are heavily duplicated throughout the kernel, each > > with semantic differences from one another and largely unavailable for use > > elsewhere. > > > > Rather than add yet another implementation, we extend the generic pagewalk > > logic to enable the installation of page table entries by adding a new > > install_pte() callback in mm_walk_ops. If this is specified, then upon > > encountering a missing page table entry, we allocate and install a new one > > and continue the traversal. > > > > If a THP huge page is encountered, we make use of existing logic to split > > it. Then once we reach the PTE level, we invoke the install_pte() callback > > which provides a PTE entry to install. We do not support hugetlb at this > > stage. > > > > If this function returns an error, or an allocation fails during the > > operation, we abort the operation altogether. It is up to the caller to > > deal appropriately with partially populated page table ranges. > > > > If install_pte() is defined, the semantics of pte_entry() change - this > > callback is then only invoked if the entry already exists. This is a useful > > property, as it allows a caller to handle existing PTEs while installing > > new ones where necessary in the specified range. > > > > If install_pte() is not defined, then there is no functional difference to > > this patch, so all existing logic will work precisely as it did before. > > > > As we only permit the installation of PTEs where a mapping does not already > > exist there is no need for TLB management, however we do invoke > > update_mmu_cache() for architectures which require manual maintenance of > > mappings for other CPUs. > > > > We explicitly do not allow the existing page walk API to expose this > > feature as it is dangerous and intended for internal mm use only. Therefore > > we provide a new walk_page_range_mm() function exposed only to > > mm/internal.h. > > > > Reviewed-by: Jann Horn <jannh@xxxxxxxxxx> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > <snip> > > > /* > > * We want to know the real level where a entry is located ignoring any > > * folding of levels which may be happening. For example if p4d is folded then > > @@ -29,9 +34,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr, > > int err = 0; > > > > for (;;) { > > - err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk); > > - if (err) > > - break; > > + if (ops->install_pte && pte_none(ptep_get(pte))) { > > + pte_t new_pte; > > + > > + err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte, > > + walk); > > + if (err) > > + break; > > + > > + set_pte_at(walk->mm, addr, pte, new_pte); > > While the guard pages install ptes unconditionally, maybe some install_pte > handler implementation would sometimes want to skip, should ve define an > error code that means its skipped and just continue instead of set_pte_at()? > Or leave it until such handler appears. I'm not sure under what circumstances you'd want to skip though precisely? There's nothing populated, and the user is defining the range in which to install a PTE if nothing is populated. If they wanted more complicated handling they could do multiple, smaller calls. Things are inherently racey with these walks so there's no benefit in doing everything at once. > > > + /* Non-present before, so for arches that need it. */ > > + if (!WARN_ON_ONCE(walk->no_vma)) > > + update_mmu_cache(walk->vma, addr, pte); > > + } else { > > + err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk); > > + if (err) > > + break; > > + } > > if (addr >= end - PAGE_SIZE) > > break; > > addr += PAGE_SIZE; > > @@ -89,11 +108,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, > > again: > > next = pmd_addr_end(addr, end); > > if (pmd_none(*pmd)) { > > - if (ops->pte_hole) > > + if (ops->install_pte) > > + err = __pte_alloc(walk->mm, pmd); > > + else if (ops->pte_hole) > > err = ops->pte_hole(addr, next, depth, walk); > > if (err) > > break; > > - continue; > > + if (!ops->install_pte) > > + continue; > > } > > > > walk->action = ACTION_SUBTREE; > > @@ -116,7 +138,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, > > */ > > if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) || > > walk->action == ACTION_CONTINUE || > > - !(ops->pte_entry)) > > + !(ops->pte_entry || ops->install_pte)) > > continue; > > BTW, I find it hard to read this condition even before your patch, oh well. Agreed, this badly needs refactoring, but felt out of scope for this change. > But if I read it correctly, does it mean we're going to split a pmd-mapped > THP if we have a install_pte handler? But is that really necessary if the > pmd splitting results in all ptes populated, and thus the install_pte > handler can't do anything with any pte anyway? Yes... however nothing else here in the logic has special handling for transhuge pages AND there is already an interface provided to prevent this if you want, which we use in commit 3/5, that is to provide pud, pmd walkers that set walk->action = ACTION_CONTINUE if transhuge. Having said that, it kind of sucks that we are doing a useless split here. Hmm. In the pte_entry() case you might very well want to split and do something with the PTE. With the install you are only interested if it's non-present... It's not _completely_ infeasible that a user would want this, but it's very unlikely. OK so yeah let's add it and clean up this expression while we're at it, will fix on respin. > > > if (walk->vma) > > @@ -148,11 +170,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, > > again: > > next = pud_addr_end(addr, end); > > if (pud_none(*pud)) { > > - if (ops->pte_hole) > > + if (ops->install_pte) > > + err = __pmd_alloc(walk->mm, pud, addr); > > + else if (ops->pte_hole) > > err = ops->pte_hole(addr, next, depth, walk); > > if (err) > > break; > > - continue; > > + if (!ops->install_pte) > > + continue; > > } > > > > walk->action = ACTION_SUBTREE; > > @@ -167,7 +192,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, > > > > if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) || > > walk->action == ACTION_CONTINUE || > > - !(ops->pmd_entry || ops->pte_entry)) > > + !(ops->pmd_entry || ops->pte_entry || ops->install_pte)) > > continue; > > Ditto? >