On 06/02/2025 12:55, Ryan Roberts wrote: > On 06/02/2025 06:15, Anshuman Khandual wrote: >> On 2/5/25 20:39, Ryan Roberts wrote: >>> arm64 supports multiple huge_pte sizes. Some of the sizes are covered by >>> a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some >>> are covered by multiple ptes at a particular level (CONT_PTE_SIZE, >>> CONT_PMD_SIZE). So the function has to figure out the size from the >>> huge_pte pointer. This was previously done by walking the pgtable to >>> determine the level, then using the PTE_CONT bit to determine the number >>> of ptes. >> >> Actually PTE_CONT gets used to determine if the entry is normal i.e >> PMD/PUD based huge page or cont PTE/PMD based huge page just to call >> into standard __ptep_get_and_clear() or specific get_clear_contig(), >> after determining find_num_contig() by walking the page table. >> >> PTE_CONT presence is only used to determine the switch above but not >> to determine the number of ptes for the mapping as mentioned earlier. > > Sorry I don't really follow your distinction; PTE_CONT is used to decide whether > we are operating on a single entry (pte_cont()==false) or on multiple entires > (pte_cont()==true). For the multiple entry case, the level tells you the exact > number. > > I can certainly tidy up this description a bit, but I think we both agree that > the value of PTE_CONT is one of the inputs into deciding how many entries need > to be operated on? > >> >> There are two similar functions which determines the >> >> static int find_num_contig(struct mm_struct *mm, unsigned long addr, >> pte_t *ptep, size_t *pgsize) >> { >> pgd_t *pgdp = pgd_offset(mm, addr); >> p4d_t *p4dp; >> pud_t *pudp; >> pmd_t *pmdp; >> >> *pgsize = PAGE_SIZE; >> p4dp = p4d_offset(pgdp, addr); >> pudp = pud_offset(p4dp, addr); >> pmdp = pmd_offset(pudp, addr); >> if ((pte_t *)pmdp == ptep) { >> *pgsize = PMD_SIZE; >> return CONT_PMDS; >> } >> return CONT_PTES; >> } >> >> find_num_contig() already assumes that the entry is contig huge page and >> it just finds whether it is PMD or PTE based one. This always requires a >> prior PTE_CONT bit being set determination via pte_cont() before calling >> find_num_contig() in each instance. > > Agreed. > >> >> But num_contig_ptes() can get the same information without walking the >> page table and thus without predetermining if PTE_CONT is set or not. >> size can be derived from VMA argument when present. > > Also agreed. But VMA is not provided to this function. And because we want to > use it for kernel space mappings, I think it's a bad idea to pass VMA. > >> >> static inline int num_contig_ptes(unsigned long size, size_t *pgsize) >> { >> int contig_ptes = 0; >> >> *pgsize = size; >> >> switch (size) { >> #ifndef __PAGETABLE_PMD_FOLDED >> case PUD_SIZE: >> if (pud_sect_supported()) >> contig_ptes = 1; >> break; >> #endif >> case PMD_SIZE: >> contig_ptes = 1; >> break; >> case CONT_PMD_SIZE: >> *pgsize = PMD_SIZE; >> contig_ptes = CONT_PMDS; >> break; >> case CONT_PTE_SIZE: >> *pgsize = PAGE_SIZE; >> contig_ptes = CONT_PTES; >> break; >> } >> >> return contig_ptes; >> } >> >> On a side note, why cannot num_contig_ptes() be used all the time and >> find_num_contig() be dropped ? OR am I missing something here. > > There are 2 remaining users of find_num_contig() after my series: > huge_ptep_set_access_flags() and huge_ptep_set_wrprotect(). Both of them can > only be legitimately called for present ptes (so its safe to check pte_cont()). > huge_ptep_set_access_flags() already has the VMA so it would be easy to convert > to num_contig_ptes(). huge_ptep_set_wrprotect() doesn't have the VMA but I guess > you could do the trick where you take the size of the folio that the pte points to? > > So yes, I think we could drop find_num_contig() and I agree it would be an > improvement. > > But to be honest, grabbing the folio size also feels like a hack to me (we do > this in other places too). While today, the folio size is guarranteed to be be > the same size as the huge pte in practice, I'm not sure there is any spec that > mandates that? > > Perhaps the most robust thing is to just have a PTE_CONT bit for the swap-pte so > we can tell the size of both present and non-present ptes, then do the table > walk trick to find the level. Shrug. > >> >>> >>> But the PTE_CONT bit is only valid when the pte is present. For >>> non-present pte values (e.g. markers, migration entries), the previous >>> implementation was therefore erroniously determining the size. There is >>> at least one known caller in core-mm, move_huge_pte(), which may call >>> huge_ptep_get_and_clear() for a non-present pte. So we must be robust to >>> this case. Additionally the "regular" ptep_get_and_clear() is robust to >>> being called for non-present ptes so it makes sense to follow the >>> behaviour. >> >> With VMA argument and num_contig_ptes() dependency on PTE_CONT being set >> and the entry being mapped might not be required. >>>> >>> Fix this by using the new sz parameter which is now provided to the >>> function. Additionally when clearing each pte in a contig range, don't >>> gather the access and dirty bits if the pte is not present. >> >> Makes sense. >> >>> >>> An alternative approach that would not require API changes would be to >>> store the PTE_CONT bit in a spare bit in the swap entry pte. But it felt >>> cleaner to follow other APIs' lead and just pass in the size. >> >> Right, changing the arguments in the API will help solve this problem. >> >>> >>> While we are at it, add some debug warnings in functions that require >>> the pte is present. >>> >>> As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap >>> entry offset field (layout of non-present pte). Since hugetlb is never >>> swapped to disk, this field will only be populated for markers, which >>> always set this bit to 0 and hwpoison swap entries, which set the offset >>> field to a PFN; So it would only ever be 1 for a 52-bit PVA system where >>> memory in that high half was poisoned (I think!). So in practice, this >>> bit would almost always be zero for non-present ptes and we would only >>> clear the first entry if it was actually a contiguous block. That's >>> probably a less severe symptom than if it was always interpretted as 1 >>> and cleared out potentially-present neighboring PTEs. >>> >>> Cc: <stable@xxxxxxxxxxxxxxx> >>> Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") >>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >>> --- >>> arch/arm64/mm/hugetlbpage.c | 54 ++++++++++++++++++++----------------- >>> 1 file changed, 29 insertions(+), 25 deletions(-) >>> >>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c >>> index 06db4649af91..328eec4bfe55 100644 >>> --- a/arch/arm64/mm/hugetlbpage.c >>> +++ b/arch/arm64/mm/hugetlbpage.c >>> @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm, >>> unsigned long pgsize, >>> unsigned long ncontig) >>> { >>> - pte_t orig_pte = __ptep_get(ptep); >>> - unsigned long i; >>> - >>> - for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) { >>> - pte_t pte = __ptep_get_and_clear(mm, addr, ptep); >>> - >>> - /* >>> - * If HW_AFDBM is enabled, then the HW could turn on >>> - * the dirty or accessed bit for any page in the set, >>> - * so check them all. >>> - */ >>> - if (pte_dirty(pte)) >>> - orig_pte = pte_mkdirty(orig_pte); >>> - >>> - if (pte_young(pte)) >>> - orig_pte = pte_mkyoung(orig_pte); >>> + pte_t pte, tmp_pte; >>> + bool present; >>> + >>> + pte = __ptep_get_and_clear(mm, addr, ptep); >>> + present = pte_present(pte); >>> + while (--ncontig) { >> >> Although this does the right thing by calling __ptep_get_and_clear() once >> for non-contig huge pages but wondering if cont/non-cont separation should >> be maintained in the caller huge_ptep_get_and_clear(), keeping the current >> logical bifurcation intact. > > To what benefit? > >> >>> + ptep++; >>> + addr += pgsize; >>> + tmp_pte = __ptep_get_and_clear(mm, addr, ptep); >>> + if (present) { >> >> Checking for present entries makes sense here. >> >>> + if (pte_dirty(tmp_pte)) >>> + pte = pte_mkdirty(pte); >>> + if (pte_young(tmp_pte)) >>> + pte = pte_mkyoung(pte); >>> + } >>> } >>> - return orig_pte; >>> + return pte; >>> } >>> >>> static pte_t get_clear_contig_flush(struct mm_struct *mm, >>> @@ -401,13 +400,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, >>> { >>> int ncontig; >>> size_t pgsize; >>> - pte_t orig_pte = __ptep_get(ptep); >>> - >>> - if (!pte_cont(orig_pte)) >>> - return __ptep_get_and_clear(mm, addr, ptep); >>> - >>> - ncontig = find_num_contig(mm, addr, ptep, &pgsize); >>> >>> + ncontig = num_contig_ptes(sz, &pgsize); >> >> __ptep_get_and_clear() can still be called here if 'ncontig' is >> returned as 0 indicating a normal non-contig huge page thus >> keeping get_clear_contig() unchanged just to handle contig huge >> pages. > > I think you're describing the case where num_contig_ptes() returns 0? The > intention, from my reading of the function, is that num_contig_ptes() returns > the number of ptes that need to be operated on (e.g. 1 for a single entry or N > for a contig block). It will only return 0 if called with an invalid huge size. > I don't believe it will ever "return 0 indicating a normal non-contig huge page". > > Perhaps the right solution is to add a warning if returning 0? > >> >>> return get_clear_contig(mm, addr, ptep, pgsize, ncontig); >>> } >>> >>> @@ -451,6 +445,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma, >>> pgprot_t hugeprot; >>> pte_t orig_pte; >>> >>> + VM_WARN_ON(!pte_present(pte)); >>> + >>> if (!pte_cont(pte)) >>> return __ptep_set_access_flags(vma, addr, ptep, pte, dirty); >>> >>> @@ -461,6 +457,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma, >>> return 0; >>> >>> orig_pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig); >>> + VM_WARN_ON(!pte_present(orig_pte)); >>> >>> /* Make sure we don't lose the dirty or young state */ >>> if (pte_dirty(orig_pte)) >>> @@ -485,7 +482,10 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm, >>> size_t pgsize; >>> pte_t pte; >>> >>> - if (!pte_cont(__ptep_get(ptep))) { >>> + pte = __ptep_get(ptep); >>> + VM_WARN_ON(!pte_present(pte)); >>> + >>> + if (!pte_cont(pte)) { >>> __ptep_set_wrprotect(mm, addr, ptep); >>> return; >>> } >>> @@ -509,8 +509,12 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma, >>> struct mm_struct *mm = vma->vm_mm; >>> size_t pgsize; >>> int ncontig; >>> + pte_t pte; >>> >>> - if (!pte_cont(__ptep_get(ptep))) >>> + pte = __ptep_get(ptep); >>> + VM_WARN_ON(!pte_present(pte)); >>> + >>> + if (!pte_cont(pte)) >>> return ptep_clear_flush(vma, addr, ptep); >>> >>> ncontig = find_num_contig(mm, addr, ptep, &pgsize); >> >> In all the above instances should not num_contig_ptes() be called to determine >> if a given entry is non-contig or contig huge page, thus dropping the need for >> pte_cont() and pte_present() tests as proposed here. > > Yeah maybe. But as per above, we have options for how to do that. I'm not sure > which is preferable at the moment. What do you think? Regardless, I think that > cleanup would be a separate patch (which I'm happy to add for v2). For this bug > fix, I was trying to do the minimum. I took another look at this; I concluded that we should switch from find_num_contig() to num_contig_ptes() for the 2 cases where we have the vma and can therefore directly get the huge_ptep size. That leaves one callsite in huge_ptep_set_wrprotect(), which doesn't have the vma. One approach would be to grab the folio out of the pte and use the size of the folio. That's already done in huge_ptep_get(). But actually I think that's a very brittle approach because there is nothing stoping the size of the folio from changing in future (you could have a folio twice the size mapped by 2 huge_ptes for example). So I'm actually proposing to keep find_num_contig() and use it additionally in huge_ptep_get(). That will mean we end up with 2 callsites for find_num_contig(), and 0 places that infer the huge_pte size from the folio size. I think that's much cleaner personally. I'll do this for v2. Thanks, Ryan > > Thanks, > Ryan > >