Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

Ryan Roberts <ryan.roberts@xxxxxxx> · Fri, 24 Nov 2023 08:54:14 +0000

On 24/11/2023 01:35, Alistair Popple wrote:
> 
> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
> 
>> On 23/11/2023 05:13, Alistair Popple wrote:
>>>
>>> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
>>>
>>>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>>>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>>>> a full address space teardown is in progress. We use this information to
>>>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>>>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>>>> contiguous neighbours to keep their contig bit set, because we know we
>>>> are about to clear the rest too.
>>>>
>>>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>>>> 32x what it was before PTE_CONT support was wired up, when compiling the
>>>> kernel. With this optimization in place, we are back down to the
>>>> original cost.
>>>>
>>>> This approach is not perfect though, as for the duration between
>>>> returning from the first call to ptep_get_and_clear_full() and making
>>>> the final call, the contpte block in an intermediate state, where some
>>>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>>>> other APIs are called for the ptes in the contpte block during that
>>>> time, we have to be very careful. The core code currently interleaves
>>>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>>>> must be careful to ignore the cleared entries when accumulating the
>>>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>>>> other calls we might resonably expect are to set markers in the
>>>> previously cleared ptes. (We shouldn't see valid entries being set until
>>>> after the tlbi, at which point we are no longer in the intermediate
>>>> state). Since markers are not valid, this is safe; set_ptes() will see
>>>> the old, invalid entry and will not attempt to unfold. And the new pte
>>>> is also invalid so it won't attempt to fold. We shouldn't see this for
>>>> the 'full' case anyway.
>>>>
>>>> The last remaining issue is returning the access/dirty bits. That info
>>>> could be present in any of the ptes in the contpte block. ptep_get()
>>>> will gather those bits from across the contpte block. We don't bother
>>>> doing that here, because we know that the information is used by the
>>>> core-mm to mark the underlying folio as accessed/dirty. And since the
>>>> same folio must be underpinning the whole block (that was a requirement
>>>> for folding in the first place), that information will make it to the
>>>> folio eventually once all the ptes have been cleared. This approach
>>>> means we don't have to play games with accumulating and storing the
>>>> bits. It does mean that any interleaved calls to ptep_get() may lack
>>>> correct access/dirty information if we have already cleared the pte that
>>>> happened to store it. The core code does not rely on this though.
>>>
>>> Does not *currently* rely on this. I can't help but think it is
>>> potentially something that could change in the future though which would
>>> lead to some subtle bugs.
>>
>> Yes, there is a risk, although IMHO, its very small.
>>
>>>
>>> Would there be any may of avoiding this? Half baked thought but could
>>> you for example copy the access/dirty information to the last (or
>>> perhaps first, most likely invalid) PTE?
>>
>> I spent a long time thinking about this and came up with a number of
>> possibilities, none of them ideal. In the end, I went for the simplest one
>> (which works but suffers from the problem that it depends on the way it is
>> called not changing).
> 
> Ok, that answers my underlying question of "has someone thought about
> this and are there any easy solutions". I suspected that was the case
> given the excellent write up though!
> 
>> 1) copy the access/dirty flags into all the remaining uncleared ptes within the
>> contpte block. This is how I did it in v1; although it was racy. I think this
>> could be implemented correctly but its extremely complex.
>>
>> 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
>> can clear 1 or more full contpte blocks in a single call - the ptes are never in
>> an intermediate state. This is difficult because ptep_get_and_clear_full()
>> returns the pte that was cleared so its difficult to scale that up to multiple ptes.
>>
>> 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
>> to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
>> the other functions that ptep_get_and_clear_full() is not on-going when they are
>> called. So we would get a clear sign that usage patterns have changed. But there
>> is no easy place to store that state (other than scanning a contpte block
>> looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.
>>
>> 4) The simple approach I ended up taking; I thought it would be best to keep it
>> simple and see if anyone was concerned before doing something more drastic.
>>
>> What do you think? If we really need to solve this, then option 1 is my
>> preferred route, but it would take some time to figure out and reason about a
>> race-free scheme.
> 
> Well I like simple, and I agree the risk is small. But I can't help feel
> the current situation is too subtle, mainly because it is architecture
> specific and the assumptions are not communicated in core-mm code
> anywhere. But also none of the aternatives seem much better.
> 
> However there are only three callers of ptep_get_and_clear_full(), and
> all of these hold the PTL. So if I'm not mistaken that should exclude
> just about all users of ptep_get*() which will take the ptl before hand.

The problem isn't racing threads because as you say, the PTL is already
serializing all calls except ptep_get_lockless(). And although there are 3
callers to ptep_get_and_clear_full(), only the caller in zap_pte_range() ever
calls it with full=1, as I recall.

The problem is that the caller in zap_pte_range() does this:

ptl = lock_page_table()
for each pte {
	ptent = ptep_get(pte);
	if (pte_present(ptent) {
		ptent = ptep_get_and_clear_full(ptent);
		if (pte_dirty(ptent))
			...
		if (pte_young(ptent))
			...
	}
}
unlock_page_table(ptl)

It deliberately interleves calls to ptep_get() and ptep_get_and_clear_full()
under the ptl. So if the loop is iterating over a contpte block and the HW
happens to be storing the access/dirty info in the first pte entry, then the
first time through the loop, ptep_get() will return the correct access/dirty
info, as will ptep_get_and_clear_full(). The next time through the loop though,
the access/dirty info which was in the previous pte is now cleared so ptep_get()
and ptep_get_and_clear_full() will return old/clean. It all works, but is fragile.

> 
> So really that only leaves ptep_get_lockless() that could/should
> interleave right? 

Yes, but ptep_get_lockless() is special. Since it is called without the PTL, it
is very careful to ensure that the contpte block is in a consistent state and it
keeps trying until it is. So this will always return the correct consistent
information.

> From a quick glance of those users none look at the
> young/dirty information anyway, so I wonder if we can just assert in the
> core-mm that ptep_get_lockless() does not return young/dirty information
> and clear it in the helpers? That would make things explicit and
> consistent which would address my concern (although I haven't looked too
> closely at the details there).

As per explanation above, its not ptep_get_lockless() that is the problem so I
don't think this helps.

Thanks,
Ryan

> 
>> Thanks,
>> Ryan
>