Le 22/05/2024 à 03:13, Nicholas Piggin a écrit : > On Tue May 21, 2024 at 2:43 AM AEST, Christophe Leroy wrote: >> >> >> Le 20/05/2024 à 14:54, Nicholas Piggin a écrit : >>> On Sat May 18, 2024 at 5:00 AM AEST, Christophe Leroy wrote: >>>> On book3s/64, the only user of hugepd is hash in 4k mode. >>>> >>>> All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD. >>>> >>>> Rework hash-4k to use contiguous PMD and PUD instead. >>>> >>>> In that setup there are only two huge page sizes: 16M and 16G. >>>> >>>> 16M sits at PMD level and 16G at PUD level. >>>> >>>> pte_update doesn't know page size, lets use the same trick as >>>> hpte_need_flush() to get page size from segment properties. That's >>>> not the most efficient way but let's do that until callers of >>>> pte_update() provide page size instead of just a huge flag. >>>> >>>> Signed-off-by: Christophe Leroy <christophe.leroy@xxxxxxxxxx> >>>> --- >>>> arch/powerpc/include/asm/book3s/64/hash-4k.h | 15 -------- >>>> arch/powerpc/include/asm/book3s/64/hash.h | 38 +++++++++++++++---- >>>> arch/powerpc/include/asm/book3s/64/hugetlb.h | 38 ------------------- >>>> .../include/asm/book3s/64/pgtable-4k.h | 34 ----------------- >>>> .../include/asm/book3s/64/pgtable-64k.h | 20 ---------- >>>> arch/powerpc/include/asm/hugetlb.h | 4 ++ >>>> .../include/asm/nohash/32/hugetlb-8xx.h | 4 -- >>>> .../powerpc/include/asm/nohash/hugetlb-e500.h | 4 -- >>>> arch/powerpc/include/asm/page.h | 8 ---- >>>> arch/powerpc/mm/book3s64/hash_utils.c | 11 ++++-- >>>> arch/powerpc/mm/book3s64/pgtable.c | 12 ------ >>>> arch/powerpc/mm/hugetlbpage.c | 19 ---------- >>>> arch/powerpc/mm/pgtable.c | 2 +- >>>> arch/powerpc/platforms/Kconfig.cputype | 1 - >>>> 14 files changed, 43 insertions(+), 167 deletions(-) >>>> >>>> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h >>>> index 6472b08fa1b0..c654c376ef8b 100644 >>>> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h >>>> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h >>>> @@ -74,21 +74,6 @@ >>>> #define remap_4k_pfn(vma, addr, pfn, prot) \ >>>> remap_pfn_range((vma), (addr), (pfn), PAGE_SIZE, (prot)) >>>> >>>> -#ifdef CONFIG_HUGETLB_PAGE >>>> -static inline int hash__hugepd_ok(hugepd_t hpd) >>>> -{ >>>> - unsigned long hpdval = hpd_val(hpd); >>>> - /* >>>> - * if it is not a pte and have hugepd shift mask >>>> - * set, then it is a hugepd directory pointer >>>> - */ >>>> - if (!(hpdval & _PAGE_PTE) && (hpdval & _PAGE_PRESENT) && >>>> - ((hpdval & HUGEPD_SHIFT_MASK) != 0)) >>>> - return true; >>>> - return false; >>>> -} >>>> -#endif >>>> - >>>> /* >>>> * 4K PTE format is different from 64K PTE format. Saving the hash_slot is just >>>> * a matter of returning the PTE bits that need to be modified. On 64K PTE, >>>> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h >>>> index faf3e3b4e4b2..509811ca7695 100644 >>>> --- a/arch/powerpc/include/asm/book3s/64/hash.h >>>> +++ b/arch/powerpc/include/asm/book3s/64/hash.h >>>> @@ -4,6 +4,7 @@ >>>> #ifdef __KERNEL__ >>>> >>>> #include <asm/asm-const.h> >>>> +#include <asm/book3s/64/slice.h> >>>> >>>> /* >>>> * Common bits between 4K and 64K pages in a linux-style PTE. >>>> @@ -161,14 +162,10 @@ extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr, >>>> pte_t *ptep, unsigned long pte, int huge); >>>> unsigned long htab_convert_pte_flags(unsigned long pteflags, unsigned long flags); >>>> /* Atomic PTE updates */ >>>> -static inline unsigned long hash__pte_update(struct mm_struct *mm, >>>> - unsigned long addr, >>>> - pte_t *ptep, unsigned long clr, >>>> - unsigned long set, >>>> - int huge) >>>> +static inline unsigned long hash__pte_update_one(pte_t *ptep, unsigned long clr, >>>> + unsigned long set) >>>> { >>>> __be64 old_be, tmp_be; >>>> - unsigned long old; >>>> >>>> __asm__ __volatile__( >>>> "1: ldarx %0,0,%3 # pte_update\n\ >>>> @@ -182,11 +179,38 @@ static inline unsigned long hash__pte_update(struct mm_struct *mm, >>>> : "r" (ptep), "r" (cpu_to_be64(clr)), "m" (*ptep), >>>> "r" (cpu_to_be64(H_PAGE_BUSY)), "r" (cpu_to_be64(set)) >>>> : "cc" ); >>>> + >>>> + return be64_to_cpu(old_be); >>>> +} >>>> + >>>> +static inline unsigned long hash__pte_update(struct mm_struct *mm, >>>> + unsigned long addr, >>>> + pte_t *ptep, unsigned long clr, >>>> + unsigned long set, >>>> + int huge) >>>> +{ >>>> + unsigned long old; >>>> + >>>> + old = hash__pte_update_one(ptep, clr, set); >>>> + >>>> + if (huge && IS_ENABLED(CONFIG_PPC_4K_PAGES)) { >>>> + unsigned int psize = get_slice_psize(mm, addr); >>>> + int nb, i; >>>> + >>>> + if (psize == MMU_PAGE_16M) >>>> + nb = SZ_16M / PMD_SIZE; >>>> + else if (psize == MMU_PAGE_16G) >>>> + nb = SZ_16G / PUD_SIZE; >>>> + else >>>> + nb = 1; >>>> + >>>> + for (i = 1; i < nb; i++) >>>> + hash__pte_update_one(ptep + i, clr, set); >>>> + } >>>> /* huge pages use the old page table lock */ >>>> if (!huge) >>>> assert_pte_locked(mm, addr); >>>> >>>> - old = be64_to_cpu(old_be); >>>> if (old & H_PAGE_HASHPTE) >>>> hpte_need_flush(mm, addr, ptep, old, huge); >>>> >>> >>> Nice series, I don't know this hugepd code very well but I'll try. >>> Why do you have to replicate the PTE entry here? The hash table refill >>> should always be working on the first PTE of the page otherwise we have >>> bigger problems. >> >> I don't know how book3s/64 works exactly, but on nohash, when you get a >> TLB miss exception the only thing you have is the address and you don't >> know yes it is a hugepage so you get the PTE as if it was a 4k page and >> it is only when you read that PTE that you know it is a hugepage. >> >> Ok, on book3s/64 the page size seems to be encoded inside the segment so >> maybe it is a bit different but anyway the TLB miss exception (or DSI ?) >> can happen at any address. > > Right. > > If you think of the hash page table as a software loaded TLB (which > is how Linux kind of thinks of it), then DSI is a TLB miss. hash_page_x > calls find the Linux pte and load that translation into hash page table. > > One of the hard parts is keeping them coherent with low overhead. This > requires pte bits H_PAGE_BUSY as a lock and H_PAGE_HASHPTE which means > it might be in the hash table. So Linux PTE and hash PTE have to be > 1:1 in general. > > There are probably cases where we could get away from 1:1, but I would > much prefer not to. Maybe read-only access would be okay though. But > the hash_page will have to always operate on the 0th pte, which I think > we get via segment size masking, same for any set / update / clear of > the pte. > >>> >>> What paths look at the N > 0 PTEs of a contiguous page entry? >>> >> >> pte_offset_kernel() or pte_offset_map_lock() will land on any contiguous >> PTE based on the address handed to pte_index(), as if it was a standard >> (4k or 64k) page. >> >> pte_index() doesn't know it is a hugepage, that's the reason why we need >> to duplicate the entry. > > From the mm/ side of things, hugetlb page tables are always walked via > the huge vma which knows the page size and could align address... I > guess except for fast gup? Which should be read-only. So okay you do > need to replicate huge ptes for fast gup at least. Any others? > > There's going to need to be a little more to it. __hash_page_huge sets > PTE accessed and dirty for example, so if we allow any PTE readers to > check the non-0th pte we would have to do something about that. > > How do you deal with dirty/accessed bits for other subarchs? All nohash bail out of TLB miss handler when accessing a page which doesn have the ACCESSED bit or writing a page which doesn't have DIRTY bit, see commit 2c74e2586bb9 ("powerpc/40x: Rework 40x PTE access and TLB miss") and other commits it refers to. Same for the 603 which is the nohash version of book3s/32, see commits f8b58c64eaef ("powerpc/603: let's handle PAGE_DIRTY directly") and 84de6ab0e904 ("powerpc/603: don't handle PAGE_ACCESSED in TLB miss handlers."). Only the hash version of book3s/32 still updated PTE in miss handler, see https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/mm/book3s32/hash_low.S#L146 but there are no hugepages on book3s/32 > > We could just remove the hash_page setting of those bits and just cause > a fault and require Linux mm to set them. At least for hugepages we > could do that probably without any real performance worry. > > Thanks, > Nick