Hi Ryan, On Tue, Jul 2, 2024 at 9:51 AM Alexandre Ghiti <alex@xxxxxxxx> wrote: > > Hi Ryan, > > On 24/06/2024 10:00, Ryan Roberts wrote: > > On 28/05/2024 09:07, Alexandre Ghiti wrote: > >> Hi Ryan, > >> > >> On 12/05/2024 19:25, Alexandre Ghiti wrote: > >>> Hi Ryan, > >>> > >>> On Fri, May 10, 2024 at 3:49 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>> On 08/05/2024 12:34, Alexandre Ghiti wrote: > >>>>> This patchset intends to merge the contiguous ptes hugetlbfs implementation > >>>>> of arm64 and riscv. > >>>>> > >>>>> Both arm64 and riscv support the use of contiguous ptes to map pages that > >>>>> are larger than the default page table size, respectively called contpte > >>>>> and svnapot. > >>>>> > >>>>> The riscv implementation differs from the arm64's in that the LSBs of the > >>>>> pfn of a svnapot pte are used to store the size of the mapping, allowing > >>>>> for future sizes to be added (for now only 64KB is supported). That's an > >>>>> issue for the core mm code which expects to find the *real* pfn a pte points > >>>>> to. Patch 1 fixes that by always returning svnapot ptes with the real pfn > >>>>> and restores the size of the mapping when it is written to a page table. > >>>>> > >>>>> The following patches are just merges of the 2 different implementations > >>>>> that currently exist in arm64 and riscv which are very similar. It paves > >>>>> the way to the reuse of the recent contpte THP work by Ryan [1] to avoid > >>>>> reimplementing the same in riscv. > >>>> Hi Alexandre, > >>>> > >>>> I've skimmed through this series and the one that moves contpte. I can see there > >>>> is definitely value in sharing the implementation, and the rough shape of things > >>>> seems appropriate. I had some minor concerns about making it harder to implement > >>>> potential future arm64 errata workarounds but on reflection, most of the > >>>> now-shared code is really just wrapping the primitives that are still > >>>> arch-specific. > >>>> > >>>> I'm going to need to spend proper time reviewing it to give detailed feedback, > >>>> but I'll be out on paternity leave for 3 weeks from end of Monday at the latest. > >>> Too bad, I expected to discuss that with you at LSF/MM...But congrats! > >>> Hope your wife is fine :) > >>> > >>>> So realistically I won't be able to do the detailed review until at least the > >>>> first week of June. > > Hi Alexandre, > > > > Sorry for the radio silence. I'm back at work now and have some cycles to review > > this. Did you ever post a new version based on the suggestions below? > > > Unfortunately no, other things happened that took all my attention, sorry. > > > >>>> Some high level thoughts: > >>>> > >>>> - huge_ptep_* functions could be working on different sized huge ptes - arm64 > >>>> supports contpte, pmd, contpmd and pud. Is keeping them in contpte.c > >>>> appropriate? > >>> Hmm indeed, I'll see what I can do. > >> > >> So I took a look at that. It amounts to doing the same as what we do for THP > >> contptes, ie having both contpte-aware and "normal" APIs. Let's take for example > >> huge_ptep_get(), below is what I get. To me it's not that bad, so I'll implement > >> this unless there is strong opposition. > > I'm not sure I've understood what you are going here... see below. > > > >> > >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > >> index f8efbc128446..869a9aae6c68 100644 > >> --- a/arch/arm64/include/asm/pgtable.h > >> +++ b/arch/arm64/include/asm/pgtable.h > >> @@ -1715,6 +1715,16 @@ static inline void clear_young_dirty_ptes(struct > >> vm_area_struct *vma, > >> contpte_clear_young_dirty_ptes(vma, addr, ptep, nr, flags); > >> } > >> > >> +static inline pte_t huge_ptep_get(pte_t *ptep) > >> +{ > >> + pte_t orig_pte = __ptep_get(ptep); > >> + > >> + if (!pte_present(orig_pte) || !pte_cont(orig_pte)) > >> + return orig_pte; > >> + > >> + return contpte_huge_ptep_get(ptep); > > A "huge pte" is not the same as a "cont pte". A huge pte is an abstract thing, > > which maybe of a number of different sizes; on arm64 with 4K base pages, 64K, > > 2M, 32M, 1G are supported. The 64K size is implemented using the PTE_CONT bit at > > PTE level. 2M is a single PMD level block, 32M uses PMD_CONT at PMD level and 1G > > is 1 PUD block. So I'm not sure it makes sense to tie this up with "contpte_" > > functions? > > > >> +} > >> + > >> #else /* CONFIG_ARM64_CONTPTE */ > >> > >> #define ptep_get __ptep_get > >> @@ -1736,6 +1746,8 @@ static inline void clear_young_dirty_ptes(struct > >> vm_area_struct *vma, > >> #define ptep_set_access_flags __ptep_set_access_flags > >> #define clear_young_dirty_ptes __clear_young_dirty_ptes > >> > >> +#define huge_ptep_get __ptep_get > > I don't quite understand the logic here. huge ptes are needed for hugetlb so > > their definition needs to be tied to that, not to ARM64_CONTPTE, which is an > > independent feature. > > > >> + > >> #endif /* CONFIG_ARM64_CONTPTE */ > >> > >> #endif /* !__ASSEMBLY__ */ > >> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c > >> index 3f09ac73cce3..aa0ee3f02226 100644 > >> --- a/arch/arm64/mm/hugetlbpage.c > >> +++ b/arch/arm64/mm/hugetlbpage.c > >> @@ -127,28 +127,6 @@ static inline int num_contig_ptes(unsigned long size, > >> size_t *pgsize) > >> return contig_ptes; > >> } > >> > >> -pte_t huge_ptep_get(pte_t *ptep) > >> -{ > >> - int ncontig, i; > >> - size_t pgsize; > >> - pte_t orig_pte = __ptep_get(ptep); > >> - > >> - if (!pte_present(orig_pte) || !pte_cont(orig_pte)) > >> - return orig_pte; > >> - > >> - ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize); > >> - for (i = 0; i < ncontig; i++, ptep++) { > >> - pte_t pte = __ptep_get(ptep); > >> - > >> - if (pte_dirty(pte)) > >> - orig_pte = pte_mkdirty(orig_pte); > >> - > >> - if (pte_young(pte)) > >> - orig_pte = pte_mkyoung(orig_pte); > >> - } > >> - return orig_pte; > >> -} > >> - > >> /* > >> * Changing some bits of contiguous entries requires us to follow a > >> * Break-Before-Make approach, breaking the whole contiguous set > >> diff --git a/mm/contpte.c b/mm/contpte.c > >> new file mode 100644 > >> index 000000000000..4e742cf00b6f > >> --- /dev/null > >> +++ b/mm/contpte.c > >> @@ -0,0 +1,17 @@ > >> +pte_t contpte_huge_ptep_get(pte_t *ptep) > >> +{ > >> + int ncontig, i; > >> + size_t pgsize; > >> + > >> + ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize); > >> + for (i = 0; i < ncontig; i++, ptep++) { > >> + pte_t pte = __ptep_get(ptep); > >> + > >> + if (pte_dirty(pte)) > >> + orig_pte = pte_mkdirty(orig_pte); > >> + > >> + if (pte_young(pte)) > >> + orig_pte = pte_mkyoung(orig_pte); > >> + } > >> + return orig_pte; > >> +} > > I guess your observation is that contpte_ and hugepte_ code looks similar so it > > shold be grouped? I think if we can get some actual reuse that might make sense, > > but as implemented, this function is completely separate from > > contpte_ptep_get(). I wonder if its simpler just to have contpte.c for contpte_ > > and hugepte_.c for hugepte_ then they can be included in the build independently > > based on arch/core Kconfigs (e.g. CONFIG_HUGETLB_PAGE vs CONFIG_ARM64_CONTPTE). > > > Yes, you're right, this was just rambling :) > > > > > >>>> Perhaps it's better to keep huge_pte and contpte separate? Also, it > >>>> only works on arm64 because we can get away with calling the lower-level pte > >>>> functions even when the huge_pte is actually a contpmd/pmd/pud, because the > >>>> format is the same. That might present challenges to other arches if the format > >>>> is different? > >>> Yes, but I think that if that happens, we could get away with it by > >>> choosing the right function depending on the size of the mapping? > >>> > >>>> - It might be easier to review if the arm64 stuff is first moved (without > >>>> changes) then modified to make it suitable for riscv, then for riscv to be > >>>> hooked up. At the moment I'm trying to follow all 3 parts per-function. > >>> Ok, let me give it a try during your paternity leave! > > Review would certainly be easier with this approach! > > > I'll do my best to do that soon. So I finally have time to rework this and I tested moving all the arm64 functions first and then adapting them to riscv. But I can't modify a function at a time as it fails to build for riscv, since the not-yet-adapted functions are arm64 specific. So I'd have to come with a big monolithic patch that changes everything at once, which I think will not help the review process. So I'll keep the same format and should be back soon with the new version. Thanks, Alex > > Hope everything went well for you. > > Thanks, > > Alex > > > > > > Thanks, > > Ryan > > > >>>> Thanks, > >>>> Ryan > >>> Thanks, > >>> > >>> Alex > >>> > >>>>> This patchset was tested by running the libhugetlbfs testsuite with 64KB > >>>>> and 2MB pages on both architectures (on a 4KB base page size arm64 kernel). > >>>>> > >>>>> [1] > >>>>> https://lore.kernel.org/linux-arm-kernel/20240215103205.2607016-1-ryan.roberts@xxxxxxx/ > >>>>> > >>>>> Changes in v2: > >>>>> - Rebase on top of 6.9-rc3 > >>>>> > >>>>> Alexandre Ghiti (9): > >>>>> riscv: Restore the pfn in a NAPOT pte when manipulated by core mm code > >>>>> riscv: Safely remove huge_pte_offset() when manipulating NAPOT ptes > >>>>> mm: Use common huge_ptep_get() function for riscv/arm64 > >>>>> mm: Use common set_huge_pte_at() function for riscv/arm64 > >>>>> mm: Use common huge_pte_clear() function for riscv/arm64 > >>>>> mm: Use common huge_ptep_get_and_clear() function for riscv/arm64 > >>>>> mm: Use common huge_ptep_set_access_flags() function for riscv/arm64 > >>>>> mm: Use common huge_ptep_set_wrprotect() function for riscv/arm64 > >>>>> mm: Use common huge_ptep_clear_flush() function for riscv/arm64 > >>>>> > >>>>> arch/arm64/Kconfig | 1 + > >>>>> arch/arm64/include/asm/pgtable.h | 56 +++++- > >>>>> arch/arm64/mm/hugetlbpage.c | 291 +--------------------------- > >>>>> arch/riscv/Kconfig | 1 + > >>>>> arch/riscv/include/asm/hugetlb.h | 2 +- > >>>>> arch/riscv/include/asm/pgtable-64.h | 11 ++ > >>>>> arch/riscv/include/asm/pgtable.h | 153 +++++++++++++-- > >>>>> arch/riscv/mm/hugetlbpage.c | 227 ---------------------- > >>>>> arch/riscv/mm/pgtable.c | 6 +- > >>>>> mm/Kconfig | 3 + > >>>>> mm/Makefile | 1 + > >>>>> mm/contpte.c | 272 ++++++++++++++++++++++++++ > >>>>> 12 files changed, 480 insertions(+), 544 deletions(-) > >>>>> create mode 100644 mm/contpte.c > >>>>> > >>> _______________________________________________ > >>> linux-riscv mailing list > >>> linux-riscv@xxxxxxxxxxxxxxxxxxx > >>> http://lists.infradead.org/mailman/listinfo/linux-riscv