On 12/02/2024 13:43, David Hildenbrand wrote: > On 02.02.24 09:07, Ryan Roberts wrote: >> Some architectures (e.g. arm64) can tell from looking at a pte, if some >> follow-on ptes also map contiguous physical memory with the same pgprot. >> (for arm64, these are contpte mappings). >> >> Take advantage of this knowledge to optimize folio_pte_batch() so that >> it can skip these ptes when scanning to create a batch. By default, if >> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so >> the changes are optimized out and the behaviour is as before. >> >> arm64 will opt-in to providing this hint in the next patch, which will >> greatly reduce the cost of ptep_get() when scanning a range of contptes. >> >> Tested-by: John Hubbard <jhubbard@xxxxxxxxxx> >> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >> --- >> include/linux/pgtable.h | 18 ++++++++++++++++++ >> mm/memory.c | 20 +++++++++++++------- >> 2 files changed, 31 insertions(+), 7 deletions(-) >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index 50f32cccbd92..cba31f177d27 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd) >> #define arch_flush_lazy_mmu_mode() do {} while (0) >> #endif >> +#ifndef pte_batch_hint >> +/** >> + * pte_batch_hint - Number of pages that can be added to batch without scanning. >> + * @ptep: Page table pointer for the entry. >> + * @pte: Page table entry. >> + * >> + * Some architectures know that a set of contiguous ptes all map the same >> + * contiguous memory with the same permissions. In this case, it can provide a >> + * hint to aid pte batching without the core code needing to scan every pte. > > I think we might want to document here the expectation regarding > dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with > FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them > always when batching, because the dirty bit may target any pte part of the > cont-pte group either way. > > Maybe something like: > > " > An architecture implementation may only ignore the PTE accessed and dirty bits. > Further, it may only ignore the dirty bit if that bit is already not > maintained with precision per PTE inside the hinted batch, and ptep_get() > would already have to collect it from various PTEs. > " I'm proposing to simplify this to: " An architecture implementation may ignore the PTE accessed state. Further, the dirty state must apply atomically to all the PTEs described by the hint. " Which I think more accurately describes the requirement. Shout if you disagree. > > I think there are some more details to it, but I'm hoping something along > the lines above is sufficient. > > >> + >> #ifndef pte_advance_pfn >> static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr) >> { >> diff --git a/mm/memory.c b/mm/memory.c >> index 65fbe4f886c1..902665b27702 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio, >> unsigned long addr, >> { >> unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); >> const pte_t *end_ptep = start_ptep + max_nr; >> - pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), >> flags); >> - pte_t *ptep = start_ptep + 1; >> + pte_t expected_pte = __pte_batch_clear_ignored(pte, flags); >> + pte_t *ptep = start_ptep; >> bool writable; >> + int nr; >> if (any_writable) >> *any_writable = false; >> VM_WARN_ON_FOLIO(!pte_present(pte), folio); >> - while (ptep != end_ptep) { >> + nr = pte_batch_hint(ptep, pte); >> + expected_pte = pte_advance_pfn(expected_pte, nr); >> + ptep += nr; >> + > > *Maybe* it's easier to get when initializing expected_pte+ptep only once. > > Like: > > [...] > pte_t expected_pte, *ptep; > [...] > > nr = pte_batch_hint(start_ptep, pte); > expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags); > ptep = start_ptep + nr; > >> + while (ptep < end_ptep) { >> pte = ptep_get(ptep); >> if (any_writable) >> writable = !!pte_write(pte); >> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio, >> unsigned long addr, >> * corner cases the next PFN might fall into a different >> * folio. >> */ >> - if (pte_pfn(pte) == folio_end_pfn) >> + if (pte_pfn(pte) >= folio_end_pfn) >> break; >> if (any_writable) >> *any_writable |= writable; >> - expected_pte = pte_advance_pfn(expected_pte, 1); >> - ptep++; >> + nr = pte_batch_hint(ptep, pte); >> + expected_pte = pte_advance_pfn(expected_pte, nr); >> + ptep += nr; >> } >> - return ptep - start_ptep; >> + return min(ptep - start_ptep, max_nr); >> } > > Acked-by: David Hildenbrand <david@xxxxxxxxxx> >