On Thu, Jan 30, 2025 at 11:43:25AM +0100, Simona Vetter wrote: > On Thu, Jan 30, 2025 at 11:27:37AM +1100, Alistair Popple wrote: > > On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote: > > > Let's document how this function is to be used, and why the requirement > > > for the folio lock might maybe be dropped in the future. > > > > Sorry, only just catching up on your other thread. The folio lock was to ensure > > the GPU got a chance to make forward progress by mapping the page. Without it > > the CPU could immediately invalidate the entry before the GPU had a chance to > > retry the fault. > > > > Obviously performance wise having such thrashing is terrible, so should > > really be avoided by userspace, but the lock at least allowed such programs > > to complete. > > Imo this is not a legit use-case. If userspace concurrently (instead of > clearly alternating) uses the same 4k page for gpu atomics and on the cpu, > it just gets to keep the fallout. > > Plus there's no guarantee that we hold the folio_lock long enough for the > gpu to actually complete the atomic, so this isn't even really helping > with forward progress even if this somehow would be a legit usecase. Yes, agree it's not a legit real world use case. In practice though it was useful for testing this and other driver code by thrashing to generate a lot device/cpu faults and invalidations. Obviously "just for testing" is not a great justification though, so if it's causing problems we could get rid of it. > But this is also why thp is potentially an issue, because if thp > constantly creates pmd entries that potentially causes false sharing and > we do have thrashing that shouldn't happen. Yeah, I don't we should extend this to thps. - Alistair > -Sima > > > > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> > > > --- > > > mm/memory.c | 25 +++++++++++++++++++++++++ > > > 1 file changed, 25 insertions(+) > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 46956994aaff..caaae8df11a9 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, > > > } > > > #endif > > > > > > +/** > > > + * restore_exclusive_pte - Restore a device-exclusive entry > > > + * @vma: VMA covering @address > > > + * @folio: the mapped folio > > > + * @page: the mapped folio page > > > + * @address: the virtual address > > > + * @ptep: PTE pointer into the locked page table mapping the folio page > > > + * @orig_pte: PTE value at @ptep > > > + * > > > + * Restore a device-exclusive non-swap entry to an ordinary present PTE. > > > + * > > > + * The folio and the page table must be locked, and MMU notifiers must have > > > + * been called to invalidate any (exclusive) device mappings. In case of > > > + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page > > > + * fault MMU_NOTIFY_EXCLUSIVE is triggered. > > > + * > > > + * Locking the folio makes sure that anybody who just converted the PTE to > > > + * a device-private entry can map it into the device, before unlocking it; so > > > + * the folio lock prevents concurrent conversion to device-exclusive. > > > > I don't quite follow this - a concurrent conversion would already fail > > because the GUP in make_device_exclusive_range() would most likely cause > > an unexpected reference during the migration. And if a migration entry > > has already been installed for the device private PTE conversion then > > make_device_exclusive_range() will skip it as a non-present entry anyway. > > > > However s/device-private/device-exclusive/ makes sense - the intent was to allow > > the device to map it before a call to restore_exclusive_pte() (ie. a CPU fault) > > could convert it back to a normal PTE. > > > > > + * TODO: the folio lock does not protect against all cases of concurrent > > > + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers > > > + * must already use MMU notifiers to sync against any concurrent changes > > > > Right. It's expected drivers are using MMU notifiers to keep page tables in > > sync, same as for hmm_range_fault(). > > > > > + * Maybe the requirement for the folio lock can be dropped in the future. > > > + */ > > > static void restore_exclusive_pte(struct vm_area_struct *vma, > > > struct folio *folio, struct page *page, unsigned long address, > > > pte_t *ptep, pte_t orig_pte) > > > -- > > > 2.48.1 > > > > > -- > Simona Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch