Ryan Roberts <ryan.roberts@xxxxxxx> writes: > On 24/11/2023 01:35, Alistair Popple wrote: >> >> Ryan Roberts <ryan.roberts@xxxxxxx> writes: >> >>> On 23/11/2023 05:13, Alistair Popple wrote: >>>> >>>> Ryan Roberts <ryan.roberts@xxxxxxx> writes: >>>> >>>>> ptep_get_and_clear_full() adds a 'full' parameter which is not present >>>>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when >>>>> a full address space teardown is in progress. We use this information to >>>>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore >>>>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the >>>>> contiguous neighbours to keep their contig bit set, because we know we >>>>> are about to clear the rest too. >>>>> >>>>> Before this optimization, the cost of arm64_sys_exit_group() exploded to >>>>> 32x what it was before PTE_CONT support was wired up, when compiling the >>>>> kernel. With this optimization in place, we are back down to the >>>>> original cost. >>>>> >>>>> This approach is not perfect though, as for the duration between >>>>> returning from the first call to ptep_get_and_clear_full() and making >>>>> the final call, the contpte block in an intermediate state, where some >>>>> ptes are cleared and others are still set with the PTE_CONT bit. If any >>>>> other APIs are called for the ptes in the contpte block during that >>>>> time, we have to be very careful. The core code currently interleaves >>>>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get() >>>>> must be careful to ignore the cleared entries when accumulating the >>>>> access and dirty bits - the same goes for ptep_get_lockless(). The only >>>>> other calls we might resonably expect are to set markers in the >>>>> previously cleared ptes. (We shouldn't see valid entries being set until >>>>> after the tlbi, at which point we are no longer in the intermediate >>>>> state). Since markers are not valid, this is safe; set_ptes() will see >>>>> the old, invalid entry and will not attempt to unfold. And the new pte >>>>> is also invalid so it won't attempt to fold. We shouldn't see this for >>>>> the 'full' case anyway. >>>>> >>>>> The last remaining issue is returning the access/dirty bits. That info >>>>> could be present in any of the ptes in the contpte block. ptep_get() >>>>> will gather those bits from across the contpte block. We don't bother >>>>> doing that here, because we know that the information is used by the >>>>> core-mm to mark the underlying folio as accessed/dirty. And since the >>>>> same folio must be underpinning the whole block (that was a requirement >>>>> for folding in the first place), that information will make it to the >>>>> folio eventually once all the ptes have been cleared. This approach >>>>> means we don't have to play games with accumulating and storing the >>>>> bits. It does mean that any interleaved calls to ptep_get() may lack >>>>> correct access/dirty information if we have already cleared the pte that >>>>> happened to store it. The core code does not rely on this though. >>>> >>>> Does not *currently* rely on this. I can't help but think it is >>>> potentially something that could change in the future though which would >>>> lead to some subtle bugs. >>> >>> Yes, there is a risk, although IMHO, its very small. >>> >>>> >>>> Would there be any may of avoiding this? Half baked thought but could >>>> you for example copy the access/dirty information to the last (or >>>> perhaps first, most likely invalid) PTE? >>> >>> I spent a long time thinking about this and came up with a number of >>> possibilities, none of them ideal. In the end, I went for the simplest one >>> (which works but suffers from the problem that it depends on the way it is >>> called not changing). >> >> Ok, that answers my underlying question of "has someone thought about >> this and are there any easy solutions". I suspected that was the case >> given the excellent write up though! >> >>> 1) copy the access/dirty flags into all the remaining uncleared ptes within the >>> contpte block. This is how I did it in v1; although it was racy. I think this >>> could be implemented correctly but its extremely complex. >>> >>> 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we >>> can clear 1 or more full contpte blocks in a single call - the ptes are never in >>> an intermediate state. This is difficult because ptep_get_and_clear_full() >>> returns the pte that was cleared so its difficult to scale that up to multiple ptes. >>> >>> 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that >>> to be called while ptep_get_and_clear_full() calls are on-going. Then assert in >>> the other functions that ptep_get_and_clear_full() is not on-going when they are >>> called. So we would get a clear sign that usage patterns have changed. But there >>> is no easy place to store that state (other than scanning a contpte block >>> looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly. >>> >>> 4) The simple approach I ended up taking; I thought it would be best to keep it >>> simple and see if anyone was concerned before doing something more drastic. >>> >>> What do you think? If we really need to solve this, then option 1 is my >>> preferred route, but it would take some time to figure out and reason about a >>> race-free scheme. >> >> Well I like simple, and I agree the risk is small. But I can't help feel >> the current situation is too subtle, mainly because it is architecture >> specific and the assumptions are not communicated in core-mm code >> anywhere. But also none of the aternatives seem much better. >> >> However there are only three callers of ptep_get_and_clear_full(), and >> all of these hold the PTL. So if I'm not mistaken that should exclude >> just about all users of ptep_get*() which will take the ptl before hand. > > The problem isn't racing threads because as you say, the PTL is already > serializing all calls except ptep_get_lockless(). And although there are 3 > callers to ptep_get_and_clear_full(), only the caller in zap_pte_range() ever > calls it with full=1, as I recall. > > The problem is that the caller in zap_pte_range() does this: > > ptl = lock_page_table() > for each pte { > ptent = ptep_get(pte); > if (pte_present(ptent) { > ptent = ptep_get_and_clear_full(ptent); > if (pte_dirty(ptent)) > ... > if (pte_young(ptent)) > ... > } > } > unlock_page_table(ptl) > > It deliberately interleves calls to ptep_get() and ptep_get_and_clear_full() > under the ptl. So if the loop is iterating over a contpte block and the HW > happens to be storing the access/dirty info in the first pte entry, then the > first time through the loop, ptep_get() will return the correct access/dirty > info, as will ptep_get_and_clear_full(). The next time through the loop though, > the access/dirty info which was in the previous pte is now cleared so ptep_get() > and ptep_get_and_clear_full() will return old/clean. It all works, but is fragile. So if ptep_get_lockless() isn't a concern what made the option posted in v1 racy (your option 1 above)? Is there something else reading PTEs or clearing PTE bits without holding the PTL that I'm missing? >> >> So really that only leaves ptep_get_lockless() that could/should >> interleave right? > > Yes, but ptep_get_lockless() is special. Since it is called without the PTL, it > is very careful to ensure that the contpte block is in a consistent state and it > keeps trying until it is. So this will always return the correct consistent > information. > >> From a quick glance of those users none look at the >> young/dirty information anyway, so I wonder if we can just assert in the >> core-mm that ptep_get_lockless() does not return young/dirty information >> and clear it in the helpers? That would make things explicit and >> consistent which would address my concern (although I haven't looked too >> closely at the details there). > > As per explanation above, its not ptep_get_lockless() that is the problem so I > don't think this helps. > > Thanks, > Ryan > >> >>> Thanks, >>> Ryan >>