Re: [PATCH v2 2/5] mm: avoid unnecessary flush on change_huge_pmd()

Nadav Amit <nadav.amit@xxxxxxxxx> · Tue, 26 Oct 2021 10:44:15 -0700

> On Oct 26, 2021, at 9:47 AM, Nadav Amit <namit@xxxxxxxxxx> wrote:
> 
> 
> 
>> On Oct 26, 2021, at 9:06 AM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>> 
>> On 10/21/21 5:21 AM, Nadav Amit wrote:
>>> The first TLB flush is only necessary to prevent the dirty bit (and with
>>> a lesser importance the access bit) from changing while the PTE is
>>> modified. However, this is not necessary as the x86 CPUs set the
>>> dirty-bit atomically with an additional check that the PTE is (still)
>>> present. One caveat is Intel's Knights Landing that has a bug and does
>>> not do so.
>> 
>> First, did I miss the check in this patch for X86_BUG_PTE_LEAK?  I don't
>> see it anywhere.
> 
> No, it is me who missed it. It should have been in pmdp_invalidate_ad():
> 
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 3481b35cb4ec..f14f64cc17b5 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -780,6 +780,30 @@ int pmd_clear_huge(pmd_t *pmd)
> 	return 0;
> }
> 
> +/*
> + * pmdp_invalidate_ad() - prevents the access and dirty bits from being further
> + * updated by the CPU.
> + *
> + * Returns the original PTE.
> + *
> + * During an access to a page, x86 CPUs set the dirty and access bit atomically
> + * with an additional check of the present-bit. Therefore, it is possible to
> + * avoid the TLB flush if we change the PTE atomically, as pmdp_establish does.
> + *
> + * We do not make this optimization on certain CPUs that has a bug that violates
> + * this behavior (specifically Knights Landing).
> + */
> +pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
> +			 pmd_t *pmdp)
> +{
> +	pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
> +
> +	if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
> +		flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +	return old;
> +}
> 
>> 
>>> -	 * pmdp_invalidate() is required to make sure we don't miss
>>> -	 * dirty/young flags set by hardware.
>> 
>> This got me thinking...  In here:
>> 
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20160708001909.FB2443E2%40viggo.jf.intel.com%2F&amp;data=04%7C01%7Cnamit%40vmware.com%7Cf6a2a69eec094b12638108d9989afb60%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637708613735772213%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=o8fYbm8BKHvWxYC9aO5e3MFLkaOnQxvDMy%2BEnYxz56I%3D&amp;reserved=0
>> 
>> I wrote:
>> 
>>> These bits are truly "stray".  In the case of the Dirty bit, the
>>> thread associated with the stray set was *not* allowed to write to
>>> the page.  This means that we do not have to launder the bit(s); we
>>> can simply ignore them.
>> 
>> Is the goal of your proposed patch here to ensure that the dirty bit is
>> not set at *all*?  Or, is it to ensure that a dirty bit which we need to
>> *launder* is never set?
> 
> At *all*.
> 
> Err… I remembered from our previous discussions that the dirty bit cannot
> be set once the R/W bit is cleared atomically. But going back to the SDM,
> I see the (relatively new?) note:
> 
> "If software on one logical processor writes to a page while software on
> another logical processor concurrently clears the R/W flag in the
> paging-structure entry that maps the page, execution on some processors may
> result in the entry’s dirty flag being set (due to the write on the first
> logical processor) and the entry’s R/W flag being clear (due to the update
> to the entry on the second logical processor). This will never occur on a
> processor that supports control-flow enforcement technology (CET)”
> 
> So I guess that this optimization can only be enabled when CET is enabled.
> 
> :(

I still wonder whether the SDM comment applies to present bit vs dirty
bit atomicity as well.

On AMD’s APM I find:

"The processor never sets the Accessed bit or the Dirty bit for a not
present page (P = 0). The ordering of Accessed and Dirty bit updates
with respect to surrounding loads and stores is discussed below.”

( The later comment regards ordering to WC memory ).

I don’t know if I read it too creatively...