Re: huge zero page confusion

Ryan Roberts <ryan.roberts@xxxxxxx> · Thu, 4 Jul 2024 15:50:47 +0100

On 04/07/2024 15:07, Kirill A. Shutemov wrote:
> On Thu, Jul 04, 2024 at 02:39:49PM +0100, Ryan Roberts wrote:
>> On 04/07/2024 12:41, Kirill A. Shutemov wrote:
>>> On Wed, Jul 03, 2024 at 06:37:48PM +0100, Ryan Roberts wrote:
>>>> Hi Kirill, Hugh, Mel,
>>>>
>>>> We recently had a problem reported at [1] that due to aarch64 arch requiring
>>>> that atomic RMW instructions raise a read fault, followed by a write fault, this
>>>> causes a huge zero page to be faulted in during the read fault, then the write
>>>> fault shatters the huge zero page, installing small zero pages for every PTE in
>>>> the PMD region, except the faulting address which gets a writable private page.
>>>>
>>>> A number of ways were discussed to solve that problem. But it got me wondering
>>>> why we have this behaviour in general for huge zero page? This seems like odd
>>>> behaviour to me. Surely it would be less effort and more aligned with the app's
>>>> expectations to notice the huge zero page in the PMD, remove it, and install a
>>>> THP, as would have been done if pmd_none() was true? Or if there is a reason to
>>>> shatter on write, why not do away with the huge zero page and save some memory,
>>>> and just install a PMD's worth of small zero pages on fault?
>>>>
>>>> Perhaps replacing the huge zero page with a huge THP on write fault would have
>>>> been a better behavior at the time, but perhaps changing that behaviour now
>>>> risks a memory bloat regression in some workloads?
>>>
>>> Yeah, I agree that WP fault on zero page page should give THP. I think
>>> treating zero page as none PMD on write page fault should be safe and
>>> reasonable.
>>
>> So you're not concerned about potential for memory bloat regressions in apps
>> that are deployed today? I'm a bit nervous to make the change without a bunch of
>> testing...
> 
> No, I am not concern. It is silly to expect different result depending on
> what comes first read or write fault to the page. It should be consistent.
> 
> On related note, I think we need to drop __split_huge_zero_page_pmd(). We
> can just unmap it on split and let caller populate it as needed as we do
> for file mappings.
> 
>>> Untested patch is below.
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 2aa986a5cd1b..04c252303951 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -552,6 +552,11 @@ static inline bool thp_migration_supported(void)
>>>  }
>>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>  
>>> +static inline bool pmd_none_or_zero_folio(pmd_t pmd)
>>> +{
>>> +	return pmd_none(pmd) || is_huge_zero_pmd(pmd);
>>> +}
>>> +
>>>  static inline int split_folio_to_list_to_order(struct folio *folio,
>>>  		struct list_head *list, int new_order)
>>>  {
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 89932fd0f62e..fdd5236004bc 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -951,7 +951,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>>>  	__folio_mark_uptodate(folio);
>>>  
>>>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>> -	if (unlikely(!pmd_none(*vmf->pmd))) {
>>> +	if (unlikely(!pmd_none_or_zero_folio(*vmf->pmd))) {
>>
>> Hmm not sure about this; Wouldn't we need to "uninstall" the huge zero page
>> somehow? I'm guessing TLB invalidation and ref count decrement on the zero page
>> (assuming its ref counted... perhaps its not).
> 
> We track huge zero page on mm level. The refcount is decremented on
> exit(2), not on zero page unmap.
> 
> And there's no need in TLB flush from write-protected to writeable. Page
> walk will re-fill TLB before delivering WP fault.

I think that's the case for x86 and only when the pfn remains the same. As its
just been expained to me by a colleague, x86 will implicitly invalidate the tlb
entry on fault, but only on the local CPU, so that's only helpful when the PFN
isn't changing. In any case, Arm never does the implicit TLB invalidation and it
needs to be done always.

By way of example, wp_page_copy() calls ptep_clear_flush() before setting the
new pte entry.

> 
> Anyway, testing is required and careful review is required. I can be rusty
> here.

Indeed. Thanks for the advice anyway. I'll put this on my todo list to implement
and send out a patch in due course.

And thanks, David; good point about being careful with uffd.

Thanks,
Ryans

> 
>>>  		goto unlock_release;
>>>  	} else {
>>>  		pmd_t entry;
>>> @@ -1536,8 +1536,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
>>>  	vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
>>>  	VM_BUG_ON_VMA(!vma->anon_vma, vma);
>>>  
>>> -	if (is_huge_zero_pmd(orig_pmd))
>>> -		goto fallback;
>>> +	VM_BUG_ON(is_huge_zero_pmd(orig_pmd));
>>>  
>>>  	spin_lock(vmf->ptl);
>>>  
>>> @@ -1606,7 +1605,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
>>>  unlock_fallback:
>>>  	folio_unlock(folio);
>>>  	spin_unlock(vmf->ptl);
>>> -fallback:
>>>  	__split_huge_pmd(vma, vmf->pmd, vmf->address, false, NULL);
>>>  	return VM_FAULT_FALLBACK;
>>>  }
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 0f47a533014e..cc12deeb0593 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -5488,15 +5488,15 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>>>  	if (pud_trans_unstable(vmf.pud))
>>>  		goto retry_pud;
>>>  
>>> -	if (pmd_none(*vmf.pmd) &&
>>> +	vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
>>> +
>>> +	if (pmd_none_or_zero_folio(vmf.orig_pmd) &&
>>>  	    thp_vma_allowable_order(vma, vm_flags,
>>>  				TVA_IN_PF | TVA_ENFORCE_SYSFS, PMD_ORDER)) {
>>>  		ret = create_huge_pmd(&vmf);
>>>  		if (!(ret & VM_FAULT_FALLBACK))
>>>  			return ret;
>>>  	} else {
>>> -		vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
>>> -
>>>  		if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
>>>  			VM_BUG_ON(thp_migration_supported() &&
>>>  					  !is_pmd_migration_entry(vmf.orig_pmd));
>>
>