On Mon, Feb 26, 2024 at 12:03:21PM +0000, Ryan Roberts wrote: > Make clear the atmicity/consistency requirements of the API and how we > achieve them. > > Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@xxxxxxx/ > Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> > --- > arch/arm64/mm/contpte.c | 24 ++++++++++++++---------- > 1 file changed, 14 insertions(+), 10 deletions(-) > > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c > index be0a226c4ff9..1b64b4c3f8bf 100644 > --- a/arch/arm64/mm/contpte.c > +++ b/arch/arm64/mm/contpte.c > @@ -183,16 +183,20 @@ EXPORT_SYMBOL_GPL(contpte_ptep_get); > pte_t contpte_ptep_get_lockless(pte_t *orig_ptep) > { > /* > - * Gather access/dirty bits, which may be populated in any of the ptes > - * of the contig range. We may not be holding the PTL, so any contiguous > - * range may be unfolded/modified/refolded under our feet. Therefore we > - * ensure we read a _consistent_ contpte range by checking that all ptes > - * in the range are valid and have CONT_PTE set, that all pfns are > - * contiguous and that all pgprots are the same (ignoring access/dirty). > - * If we find a pte that is not consistent, then we must be racing with > - * an update so start again. If the target pte does not have CONT_PTE > - * set then that is considered consistent on its own because it is not > - * part of a contpte range. > + * The ptep_get_lockless() API requires us to read and return *orig_ptep > + * so that it is self-consistent, without the PTL held, so we may be > + * racing with other threads modifying the pte. Usually a READ_ONCE() > + * would suffice, but for the contpte case, we also need to gather the > + * access and dirty bits from across all ptes in the contiguous block, > + * and we can't read all of those neighbouring ptes atomically, so any > + * contiguous range may be unfolded/modified/refolded under our feet. > + * Therefore we ensure we read a _consistent_ contpte range by checking > + * that all ptes in the range are valid and have CONT_PTE set, that all > + * pfns are contiguous and that all pgprots are the same (ignoring > + * access/dirty). If we find a pte that is not consistent, then we must > + * be racing with an update so start again. If the target pte does not > + * have CONT_PTE set then that is considered consistent on its own > + * because it is not part of a contpte range. > */ I haven't had the time to properly think about this function but, depending on what its semantics are, we might not guarantee that, at the time of reading a pte, we have the correct dirty state from the other ptes in the range. Theoretical: let's say we read the first pte in the contig range and it's clean but further down there's a dirty one. Another (v)CPU breaks the contig range, sets the dirty bit everywhere, there's some pte_mkclean for all of them and they are collapsed into a contig range again. The function above on the first (v)CPU returns a clean pte when it should have actually been dirty at the time of read. Throughout the callers of this function, I couldn't find one where it matters. So I concluded that they don't need the dirty state. Normally the dirty state is passed to the page flags, so not lost after the pte has been cleaned. -- Catalin