Re: [PATCH v1 02/16] arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes

Ryan Roberts <ryan.roberts@xxxxxxx> · Wed, 12 Feb 2025 14:44:42 +0000

On 06/02/2025 12:55, Ryan Roberts wrote:
> On 06/02/2025 06:15, Anshuman Khandual wrote:
>> On 2/5/25 20:39, Ryan Roberts wrote:
>>> arm64 supports multiple huge_pte sizes. Some of the sizes are covered by
>>> a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some
>>> are covered by multiple ptes at a particular level (CONT_PTE_SIZE,
>>> CONT_PMD_SIZE). So the function has to figure out the size from the
>>> huge_pte pointer. This was previously done by walking the pgtable to
>>> determine the level, then using the PTE_CONT bit to determine the number
>>> of ptes.
>>
>> Actually PTE_CONT gets used to determine if the entry is normal i.e
>> PMD/PUD based huge page or cont PTE/PMD based huge page just to call
>> into standard __ptep_get_and_clear() or specific get_clear_contig(),
>> after determining find_num_contig() by walking the page table.
>>
>> PTE_CONT presence is only used to determine the switch above but not
>> to determine the number of ptes for the mapping as mentioned earlier.
> 
> Sorry I don't really follow your distinction; PTE_CONT is used to decide whether
> we are operating on a single entry (pte_cont()==false) or on multiple entires
> (pte_cont()==true). For the multiple entry case, the level tells you the exact
> number.
> 
> I can certainly tidy up this description a bit, but I think we both agree that
> the value of PTE_CONT is one of the inputs into deciding how many entries need
> to be operated on?
> 
>>
>> There are two similar functions which determines the 
>>
>> static int find_num_contig(struct mm_struct *mm, unsigned long addr,
>>                            pte_t *ptep, size_t *pgsize)
>> {
>>         pgd_t *pgdp = pgd_offset(mm, addr);
>>         p4d_t *p4dp;
>>         pud_t *pudp;
>>         pmd_t *pmdp;
>>
>>         *pgsize = PAGE_SIZE;
>>         p4dp = p4d_offset(pgdp, addr);
>>         pudp = pud_offset(p4dp, addr);
>>         pmdp = pmd_offset(pudp, addr);
>>         if ((pte_t *)pmdp == ptep) {
>>                 *pgsize = PMD_SIZE;
>>                 return CONT_PMDS;
>>         }
>>         return CONT_PTES;
>> }
>>
>> find_num_contig() already assumes that the entry is contig huge page and
>> it just finds whether it is PMD or PTE based one. This always requires a
>> prior PTE_CONT bit being set determination via pte_cont() before calling
>> find_num_contig() in each instance.
> 
> Agreed.
> 
>>
>> But num_contig_ptes() can get the same information without walking the
>> page table and thus without predetermining if PTE_CONT is set or not.
>> size can be derived from VMA argument when present.
> 
> Also agreed. But VMA is not provided to this function. And because we want to
> use it for kernel space mappings, I think it's a bad idea to pass VMA.
> 
>>
>> static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
>> {
>>         int contig_ptes = 0;
>>
>>         *pgsize = size;
>>
>>         switch (size) {
>> #ifndef __PAGETABLE_PMD_FOLDED
>>         case PUD_SIZE:
>>                 if (pud_sect_supported())
>>                         contig_ptes = 1;
>>                 break;
>> #endif
>>         case PMD_SIZE:
>>                 contig_ptes = 1;
>>                 break;
>>         case CONT_PMD_SIZE:
>>                 *pgsize = PMD_SIZE;
>>                 contig_ptes = CONT_PMDS;
>>                 break;
>>         case CONT_PTE_SIZE:
>>                 *pgsize = PAGE_SIZE;
>>                 contig_ptes = CONT_PTES;
>>                 break;
>>         }
>>
>>         return contig_ptes;
>> }
>>
>> On a side note, why cannot num_contig_ptes() be used all the time and
>> find_num_contig() be dropped ? OR am I missing something here.
> 
> There are 2 remaining users of find_num_contig() after my series:
> huge_ptep_set_access_flags() and huge_ptep_set_wrprotect(). Both of them can
> only be legitimately called for present ptes (so its safe to check pte_cont()).
> huge_ptep_set_access_flags() already has the VMA so it would be easy to convert
> to num_contig_ptes(). huge_ptep_set_wrprotect() doesn't have the VMA but I guess
> you could do the trick where you take the size of the folio that the pte points to?
> 
> So yes, I think we could drop find_num_contig() and I agree it would be an
> improvement.
> 
> But to be honest, grabbing the folio size also feels like a hack to me (we do
> this in other places too). While today, the folio size is guarranteed to be be
> the same size as the huge pte in practice, I'm not sure there is any spec that
> mandates that?
> 
> Perhaps the most robust thing is to just have a PTE_CONT bit for the swap-pte so
> we can tell the size of both present and non-present ptes, then do the table
> walk trick to find the level. Shrug.
> 
>>
>>>
>>> But the PTE_CONT bit is only valid when the pte is present. For
>>> non-present pte values (e.g. markers, migration entries), the previous
>>> implementation was therefore erroniously determining the size. There is
>>> at least one known caller in core-mm, move_huge_pte(), which may call
>>> huge_ptep_get_and_clear() for a non-present pte. So we must be robust to
>>> this case. Additionally the "regular" ptep_get_and_clear() is robust to
>>> being called for non-present ptes so it makes sense to follow the
>>> behaviour.
>>
>> With VMA argument and num_contig_ptes() dependency on PTE_CONT being set
>> and the entry being mapped might not be required.
>>>>
>>> Fix this by using the new sz parameter which is now provided to the
>>> function. Additionally when clearing each pte in a contig range, don't
>>> gather the access and dirty bits if the pte is not present.
>>
>> Makes sense.
>>
>>>
>>> An alternative approach that would not require API changes would be to
>>> store the PTE_CONT bit in a spare bit in the swap entry pte. But it felt
>>> cleaner to follow other APIs' lead and just pass in the size.
>>
>> Right, changing the arguments in the API will help solve this problem.
>>
>>>
>>> While we are at it, add some debug warnings in functions that require
>>> the pte is present.
>>>
>>> As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap
>>> entry offset field (layout of non-present pte). Since hugetlb is never
>>> swapped to disk, this field will only be populated for markers, which
>>> always set this bit to 0 and hwpoison swap entries, which set the offset
>>> field to a PFN; So it would only ever be 1 for a 52-bit PVA system where
>>> memory in that high half was poisoned (I think!). So in practice, this
>>> bit would almost always be zero for non-present ptes and we would only
>>> clear the first entry if it was actually a contiguous block. That's
>>> probably a less severe symptom than if it was always interpretted as 1
>>> and cleared out potentially-present neighboring PTEs.
>>>
>>> Cc: <stable@xxxxxxxxxxxxxxx>
>>> Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit")
>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
>>> ---
>>>  arch/arm64/mm/hugetlbpage.c | 54 ++++++++++++++++++++-----------------
>>>  1 file changed, 29 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>>> index 06db4649af91..328eec4bfe55 100644
>>> --- a/arch/arm64/mm/hugetlbpage.c
>>> +++ b/arch/arm64/mm/hugetlbpage.c
>>> @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm,
>>>  			     unsigned long pgsize,
>>>  			     unsigned long ncontig)
>>>  {
>>> -	pte_t orig_pte = __ptep_get(ptep);
>>> -	unsigned long i;
>>> -
>>> -	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
>>> -		pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
>>> -
>>> -		/*
>>> -		 * If HW_AFDBM is enabled, then the HW could turn on
>>> -		 * the dirty or accessed bit for any page in the set,
>>> -		 * so check them all.
>>> -		 */
>>> -		if (pte_dirty(pte))
>>> -			orig_pte = pte_mkdirty(orig_pte);
>>> -
>>> -		if (pte_young(pte))
>>> -			orig_pte = pte_mkyoung(orig_pte);
>>> +	pte_t pte, tmp_pte;
>>> +	bool present;
>>> +
>>> +	pte = __ptep_get_and_clear(mm, addr, ptep);
>>> +	present = pte_present(pte);
>>> +	while (--ncontig) {
>>
>> Although this does the right thing by calling __ptep_get_and_clear() once
>> for non-contig huge pages but wondering if cont/non-cont separation should
>> be maintained in the caller huge_ptep_get_and_clear(), keeping the current
>> logical bifurcation intact.
> 
> To what benefit?
> 
>>
>>> +		ptep++;
>>> +		addr += pgsize;
>>> +		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
>>> +		if (present) {
>>
>> Checking for present entries makes sense here.
>>
>>> +			if (pte_dirty(tmp_pte))
>>> +				pte = pte_mkdirty(pte);
>>> +			if (pte_young(tmp_pte))
>>> +				pte = pte_mkyoung(pte);
>>> +		}
>>>  	}
>>> -	return orig_pte;
>>> +	return pte;
>>>  }
>>>  
>>>  static pte_t get_clear_contig_flush(struct mm_struct *mm,
>>> @@ -401,13 +400,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
>>>  {
>>>  	int ncontig;
>>>  	size_t pgsize;
>>> -	pte_t orig_pte = __ptep_get(ptep);
>>> -
>>> -	if (!pte_cont(orig_pte))
>>> -		return __ptep_get_and_clear(mm, addr, ptep);
>>> -
>>> -	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>>>  
>>> +	ncontig = num_contig_ptes(sz, &pgsize);
>>
>> __ptep_get_and_clear() can still be called here if 'ncontig' is
>> returned as 0 indicating a normal non-contig huge page thus
>> keeping get_clear_contig() unchanged just to handle contig huge
>> pages.
> 
> I think you're describing the case where num_contig_ptes() returns 0? The
> intention, from my reading of the function, is that num_contig_ptes() returns
> the number of ptes that need to be operated on (e.g. 1 for a single entry or N
> for a contig block). It will only return 0 if called with an invalid huge size.
> I don't believe it will ever "return 0 indicating a normal non-contig huge page".
> 
> Perhaps the right solution is to add a warning if returning 0?
> 
>>
>>>  	return get_clear_contig(mm, addr, ptep, pgsize, ncontig);
>>>  }
>>>  
>>> @@ -451,6 +445,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>>>  	pgprot_t hugeprot;
>>>  	pte_t orig_pte;
>>>  
>>> +	VM_WARN_ON(!pte_present(pte));
>>> +
>>>  	if (!pte_cont(pte))
>>>  		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
>>>  
>>> @@ -461,6 +457,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>>>  		return 0;
>>>  
>>>  	orig_pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
>>> +	VM_WARN_ON(!pte_present(orig_pte));
>>>  
>>>  	/* Make sure we don't lose the dirty or young state */
>>>  	if (pte_dirty(orig_pte))
>>> @@ -485,7 +482,10 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>>  	size_t pgsize;
>>>  	pte_t pte;
>>>  
>>> -	if (!pte_cont(__ptep_get(ptep))) {
>>> +	pte = __ptep_get(ptep);
>>> +	VM_WARN_ON(!pte_present(pte));
>>> +
>>> +	if (!pte_cont(pte)) {
>>>  		__ptep_set_wrprotect(mm, addr, ptep);
>>>  		return;
>>>  	}
>>> @@ -509,8 +509,12 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
>>>  	struct mm_struct *mm = vma->vm_mm;
>>>  	size_t pgsize;
>>>  	int ncontig;
>>> +	pte_t pte;
>>>  
>>> -	if (!pte_cont(__ptep_get(ptep)))
>>> +	pte = __ptep_get(ptep);
>>> +	VM_WARN_ON(!pte_present(pte));
>>> +
>>> +	if (!pte_cont(pte))
>>>  		return ptep_clear_flush(vma, addr, ptep);
>>>  
>>>  	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>>
>> In all the above instances should not num_contig_ptes() be called to determine
>> if a given entry is non-contig or contig huge page, thus dropping the need for
>> pte_cont() and pte_present() tests as proposed here.
> 
> Yeah maybe. But as per above, we have options for how to do that. I'm not sure
> which is preferable at the moment. What do you think? Regardless, I think that
> cleanup would be a separate patch (which I'm happy to add for v2). For this bug
> fix, I was trying to do the minimum.

I took another look at this; I concluded that we should switch from
find_num_contig() to num_contig_ptes() for the 2 cases where we have the vma and
can therefore directly get the huge_ptep size.

That leaves one callsite in huge_ptep_set_wrprotect(), which doesn't have the
vma. One approach would be to grab the folio out of the pte and use the size of
the folio. That's already done in huge_ptep_get(). But actually I think that's a
very brittle approach because there is nothing stoping the size of the folio
from changing in future (you could have a folio twice the size mapped by 2
huge_ptes for example). So I'm actually proposing to keep find_num_contig() and
use it additionally in huge_ptep_get(). That will mean we end up with 2
callsites for find_num_contig(), and 0 places that infer the huge_pte size from
the folio size. I think that's much cleaner personally. I'll do this for v2.

Thanks,
Ryan

> 
> Thanks,
> Ryan
> 
>