Thanks, Ryan, Barry, David! Best, Lance On Tue, Feb 27, 2024 at 6:53 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > On 27/02/2024 10:30, David Hildenbrand wrote: > > On 27.02.24 11:21, Lance Yang wrote: > >> On Tue, Feb 27, 2024 at 5:14 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > >>> > >>> On 27.02.24 10:07, Ryan Roberts wrote: > >>>> On 27/02/2024 02:40, Barry Song wrote: > >>>>> From: Barry Song <v-songbaohua@xxxxxxxx> > >>>>> > >>>>> madvise and some others might need folio_pte_batch to check if a range > >>>>> of PTEs are completely mapped to a large folio with contiguous physcial > >>>>> addresses. Let's export it for others to use. > >>>>> > >>>>> Cc: Lance Yang <ioworker0@xxxxxxxxx> > >>>>> Cc: Ryan Roberts <ryan.roberts@xxxxxxx> > >>>>> Cc: David Hildenbrand <david@xxxxxxxxxx> > >>>>> Cc: Yin Fengwei <fengwei.yin@xxxxxxxxx> > >>>>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > >>>>> --- > >>>>> -v1: > >>>>> at least two jobs madv_free and madv_pageout depend on it. To avoid > >>>>> conflicts and dependencies, after discussing with Lance, we prefer > >>>>> this one can land earlier. > >>>> > >>>> I think this will also ultimately be useful for mprotect too, though I haven't > >>>> looked at it properly yet. > >>>> > >>> > >>> Yes, I think we briefly discussed that. > >>> > >>>>> > >>>>> mm/internal.h | 13 +++++++++++++ > >>>>> mm/memory.c | 11 +---------- > >>>>> 2 files changed, 14 insertions(+), 10 deletions(-) > >>>>> > >>>>> diff --git a/mm/internal.h b/mm/internal.h > >>>>> index 13b59d384845..8e2bc304f671 100644 > >>>>> --- a/mm/internal.h > >>>>> +++ b/mm/internal.h > >>>>> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio) > >>>>> return (void *)(mapping & ~PAGE_MAPPING_FLAGS); > >>>>> } > >>>>> > >>>>> +/* Flags for folio_pte_batch(). */ > >>>>> +typedef int __bitwise fpb_t; > >>>>> + > >>>>> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */ > >>>>> +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0)) > >>>>> + > >>>>> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */ > >>>>> +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1)) > >>>>> + > >>>>> +extern int folio_pte_batch(struct folio *folio, unsigned long addr, > >>>>> + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, > >>>>> + bool *any_writable); > >>>>> + > >>>>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, > >>>>> int nr_throttled); > >>>>> static inline void acct_reclaim_writeback(struct folio *folio) > >>>>> diff --git a/mm/memory.c b/mm/memory.c > >>>>> index 1c45b6a42a1b..319b3be05e75 100644 > >>>>> --- a/mm/memory.c > >>>>> +++ b/mm/memory.c > >>>>> @@ -953,15 +953,6 @@ static __always_inline void __copy_present_ptes(struct > >>>>> vm_area_struct *dst_vma, > >>>>> set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr); > >>>>> } > >>>>> > >>>>> -/* Flags for folio_pte_batch(). */ > >>>>> -typedef int __bitwise fpb_t; > >>>>> - > >>>>> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */ > >>>>> -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0)) > >>>>> - > >>>>> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */ > >>>>> -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1)) > >>>>> - > >>>>> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) > >>>>> { > >>>>> if (flags & FPB_IGNORE_DIRTY) > >>>>> @@ -982,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t > >>>>> pte, fpb_t flags) > >>>>> * If "any_writable" is set, it will indicate if any other PTE besides the > >>>>> * first (given) PTE is writable. > >>>>> */ > >>>> > >>>> David was talking in Lance's patch thread, about improving the docs for this > >>>> function now that its exported. Might be worth syncing on that. > >>> > >>> Here is my take: > >>> > >>> Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> > >>> --- > >>> mm/memory.c | 22 ++++++++++++++++++---- > >>> 1 file changed, 18 insertions(+), 4 deletions(-) > >>> > >>> diff --git a/mm/memory.c b/mm/memory.c > >>> index d0b855a1837a8..098356b8805ae 100644 > >>> --- a/mm/memory.c > >>> +++ b/mm/memory.c > >>> @@ -971,16 +971,28 @@ static inline pte_t __pte_batch_clear_ignored(pte_t > >>> pte, fpb_t flags) > >>> return pte_wrprotect(pte_mkold(pte)); > >>> } > >>> > >>> -/* > >>> +/** > >>> + * folio_pte_batch - detect a PTE batch for a large folio > >>> + * @folio: The large folio to detect a PTE batch for. > >>> + * @addr: The user virtual address the first page is mapped at. > >>> + * @start_ptep: Page table pointer for the first entry. > >>> + * @pte: Page table entry for the first page. > >> > >> Nit: > >> > >> - * @pte: Page table entry for the first page. > >> + * @pte: Page table entry for the first page that must be the first subpage of > >> + * the folio excluding arm64 for now. > >> > >> IIUC, pte_batch_hint is always 1 excluding arm64 for now. > >> I'm not sure if this modification will be helpful? > > > > IIRC, Ryan made sure that this also works when passing another subpage, after > > when cont-pte is set. Otherwise this would already be broken for fork/zap. > > > > So I don't think this comment would actually be correct. > > Indeed, the spec for the function is exactly the same for arm64 as for other > arches. It's just that arm64 can accelerate the implementation by skipping > forward to the next contpte boundary when the current pte is part of a contpte > block. > > There is no requirement for pte (or addr or start_ptep) to point to the first > subpage of a folio - they can point to any subpage. > > pte, addr and start_ptep must all refer to the same entry, but I think that's > clear from the existing text. > >