Re: [PATCH 2/2] arm64/mm: Improve comment in contpte_ptep_get_lockless()

Ryan Roberts <ryan.roberts@xxxxxxx> · Mon, 4 Mar 2024 12:54:23 +0000

On 01/03/2024 18:47, Catalin Marinas wrote:
> On Mon, Feb 26, 2024 at 12:03:21PM +0000, Ryan Roberts wrote:
>> Make clear the atmicity/consistency requirements of the API and how we
>> achieve them.
>>
>> Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@xxxxxxx/
>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
>> ---
>>  arch/arm64/mm/contpte.c | 24 ++++++++++++++----------
>>  1 file changed, 14 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index be0a226c4ff9..1b64b4c3f8bf 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -183,16 +183,20 @@ EXPORT_SYMBOL_GPL(contpte_ptep_get);
>>  pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>  {
>>  	/*
>> -	 * Gather access/dirty bits, which may be populated in any of the ptes
>> -	 * of the contig range. We may not be holding the PTL, so any contiguous
>> -	 * range may be unfolded/modified/refolded under our feet. Therefore we
>> -	 * ensure we read a _consistent_ contpte range by checking that all ptes
>> -	 * in the range are valid and have CONT_PTE set, that all pfns are
>> -	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>> -	 * If we find a pte that is not consistent, then we must be racing with
>> -	 * an update so start again. If the target pte does not have CONT_PTE
>> -	 * set then that is considered consistent on its own because it is not
>> -	 * part of a contpte range.
>> +	 * The ptep_get_lockless() API requires us to read and return *orig_ptep
>> +	 * so that it is self-consistent, without the PTL held, so we may be
>> +	 * racing with other threads modifying the pte. Usually a READ_ONCE()
>> +	 * would suffice, but for the contpte case, we also need to gather the
>> +	 * access and dirty bits from across all ptes in the contiguous block,
>> +	 * and we can't read all of those neighbouring ptes atomically, so any
>> +	 * contiguous range may be unfolded/modified/refolded under our feet.
>> +	 * Therefore we ensure we read a _consistent_ contpte range by checking
>> +	 * that all ptes in the range are valid and have CONT_PTE set, that all
>> +	 * pfns are contiguous and that all pgprots are the same (ignoring
>> +	 * access/dirty). If we find a pte that is not consistent, then we must
>> +	 * be racing with an update so start again. If the target pte does not
>> +	 * have CONT_PTE set then that is considered consistent on its own
>> +	 * because it is not part of a contpte range.
>>  	 */
> 
> I haven't had the time to properly think about this function but,
> depending on what its semantics are, we might not guarantee that, at the
> time of reading a pte, we have the correct dirty state from the other
> ptes in the range.
> 
> Theoretical: let's say we read the first pte in the contig range and
> it's clean but further down there's a dirty one. Another (v)CPU breaks
> the contig range, sets the dirty bit everywhere, there's some
> pte_mkclean for all of them and they are collapsed into a contig range
> again. The function above on the first (v)CPU returns a clean pte when
> it should have actually been dirty at the time of read.

But I think that still conforms to semantics of the function. If you had the
same situation with a non-contpte mapping, the first thread may read the PTE at
any time; when it's dirty, or after its been cleaned by the other thread. It's
inherrently racy.

All that matters is that what is returned is _consistent_; you don't want to be
in a position where the first read of the block is mapping one folio, then by
the time all the access/dirty bits are read, the mapping is actually to a
different folio - that's an example of being inconsistent.

> 
> Throughout the callers of this function, I couldn't find one where it
> matters. So I concluded that they don't need the dirty state. Normally
> the dirty state is passed to the page flags, so not lost after the pte
> has been cleaned.

I agree we can simplify the semantics. But I think its better done in a separate
series (which I previously linked).

What's the bottom line here? Are you ok with this comment as a short term
solution for now, or do you want something more radical (i.e. push to get the
series that does these simplifications reviewed and in time for v6.9).

I still believe the current ptep_get_lockless() implementation is correct. So
given I have a plan to simplify in the long run, I hope we can still get this
series into v6.9 as planned.

Thanks,
Ryan