On Mon, May 24, 2021 at 11:27:22PM +1000, Alistair Popple wrote: > Some devices require exclusive write access to shared virtual > memory (SVM) ranges to perform atomic operations on that memory. This > requires CPU page tables to be updated to deny access whilst atomic > operations are occurring. > > In order to do this introduce a new swap entry > type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for > exclusive access by a device all page table mappings for the particular > range are replaced with device exclusive swap entries. This causes any > CPU access to the page to result in a fault. > > Faults are resovled by replacing the faulting entry with the original > mapping. This results in MMU notifiers being called which a driver uses > to update access permissions such as revoking atomic access. After > notifiers have been called the device will no longer have exclusive > access to the region. > > Walking of the page tables to find the target pages is handled by > get_user_pages() rather than a direct page table walk. A direct page > table walk similar to what migrate_vma_collect()/unmap() does could also > have been utilised. However this resulted in more code similar in > functionality to what get_user_pages() provides as page faulting is > required to make the PTEs present and to break COW. > > Signed-off-by: Alistair Popple <apopple@xxxxxxxxxx> > Reviewed-by: Christoph Hellwig <hch@xxxxxx> > > --- > > v9: > * Split rename of migrate_pgmap_owner into a separate patch. > * Added comments explaining SWP_DEVICE_EXCLUSIVE_* entries. > * Renamed try_to_protect{_one} to page_make_device_exclusive{_one} based > somewhat on a suggestion from Peter Xu. I was never particularly happy > with try_to_protect() as a name so think this is better. > * Removed unneccesary code and reworded some comments based on feedback > from Peter Xu. > * Removed the VMA walk when restoring PTEs for device-exclusive entries. > * Simplified implementation of copy_pte_range() to fail if the page > cannot be locked. This might lead to occasional fork() failures but at > this stage we don't think that will be an issue. > > v8: > * Remove device exclusive entries on fork rather than copy them. > > v7: > * Added Christoph's Reviewed-by. > * Minor cosmetic cleanups suggested by Christoph. > * Replace mmu_notifier_range_init_migrate/exclusive with > mmu_notifier_range_init_owner as suggested by Christoph. > * Replaced lock_page() with lock_page_retry() when handling faults. > * Restrict to anonymous pages for now. > > v6: > * Fixed a bisectablity issue due to incorrectly applying the rename of > migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test. > > v5: > * Renamed range->migrate_pgmap_owner to range->owner. > * Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which > allows notifiers called as a result of make_device_exclusive_range() to > be ignored. > * Added a check to try_to_protect_one() to detect if the pages originally > returned from get_user_pages() have been unmapped or not. > * Removed check_device_exclusive_range() as it is no longer required with > the other changes. > * Documentation update. > > v4: > * Add function to check that mappings are still valid and exclusive. > * s/long/unsigned long/ in make_device_exclusive_entry(). > --- > Documentation/vm/hmm.rst | 17 ++++ > include/linux/mmu_notifier.h | 6 ++ > include/linux/rmap.h | 4 + > include/linux/swap.h | 7 +- > include/linux/swapops.h | 44 ++++++++- > mm/hmm.c | 5 + > mm/memory.c | 128 +++++++++++++++++++++++- > mm/mprotect.c | 8 ++ > mm/page_vma_mapped.c | 9 +- > mm/rmap.c | 186 +++++++++++++++++++++++++++++++++++ > 10 files changed, 405 insertions(+), 9 deletions(-) > > diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst > index 3df79307a797..a14c2938e7af 100644 > --- a/Documentation/vm/hmm.rst > +++ b/Documentation/vm/hmm.rst > @@ -405,6 +405,23 @@ between device driver specific code and shared common code: > > The lock can now be released. > > +Exclusive access memory > +======================= > + > +Some devices have features such as atomic PTE bits that can be used to implement > +atomic access to system memory. To support atomic operations to a shared virtual > +memory page such a device needs access to that page which is exclusive of any > +userspace access from the CPU. The ``make_device_exclusive_range()`` function > +can be used to make a memory range inaccessible from userspace. > + > +This replaces all mappings for pages in the given range with special swap > +entries. Any attempt to access the swap entry results in a fault which is > +resovled by replacing the entry with the original mapping. A driver gets > +notified that the mapping has been changed by MMU notifiers, after which point > +it will no longer have exclusive access to the page. Exclusive access is > +guranteed to last until the driver drops the page lock and page reference, at > +which point any CPU faults on the page may proceed as described. > + > Memory cgroup (memcg) and rss accounting > ======================================== > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > index 8e428eb813b8..d049e0f6f756 100644 > --- a/include/linux/mmu_notifier.h > +++ b/include/linux/mmu_notifier.h > @@ -42,6 +42,11 @@ struct mmu_interval_notifier; > * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal > * a device driver to possibly ignore the invalidation if the > * owner field matches the driver's device private pgmap owner. > + * > + * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no > + * longer have exclusive access to the page. May ignore the invalidation that's > + * part of make_device_exclusive_range() if the owner field > + * matches the value passed to make_device_exclusive_range(). Perhaps s/matches/does not match/? > */ > enum mmu_notifier_event { > MMU_NOTIFY_UNMAP = 0, > @@ -51,6 +56,7 @@ enum mmu_notifier_event { > MMU_NOTIFY_SOFT_DIRTY, > MMU_NOTIFY_RELEASE, > MMU_NOTIFY_MIGRATE, > + MMU_NOTIFY_EXCLUSIVE, > }; > > #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index 0e25d829f742..3a1ce4ef9276 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -193,6 +193,10 @@ int page_referenced(struct page *, int is_locked, > bool try_to_migrate(struct page *page, enum ttu_flags flags); > bool try_to_unmap(struct page *, enum ttu_flags flags); > > +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, > + unsigned long end, struct page **pages, > + void *arg); > + > /* Avoid racy checks */ > #define PVMW_SYNC (1 << 0) > /* Look for migarion entries rather than present PTEs */ > diff --git a/include/linux/swap.h b/include/linux/swap.h > index a6d4505ecf73..306df39d7c67 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -63,11 +63,16 @@ static inline int current_is_kswapd(void) > * > * When a page is migrated from CPU to device, we set the CPU page table entry > * to a special SWP_DEVICE_* entry. s/SWP_DEVICE_*/SWP_DEVICE_{READ|WRITE}/? Since SWP_DEVICE_* covers all four too. > + * > + * When a page is mapped by the device for exclusive access we set the CPU page > + * table entries to special SWP_DEVICE_EXCLUSIVE_* entries. > */ > #ifdef CONFIG_DEVICE_PRIVATE > -#define SWP_DEVICE_NUM 2 > +#define SWP_DEVICE_NUM 4 > #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM) > #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1) > +#define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2) > +#define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3) > #else > #define SWP_DEVICE_NUM 0 > #endif > diff --git a/include/linux/swapops.h b/include/linux/swapops.h > index 4dfd807ae52a..4129bd2ff9d6 100644 > --- a/include/linux/swapops.h > +++ b/include/linux/swapops.h > @@ -120,6 +120,27 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry) > { > return unlikely(swp_type(entry) == SWP_DEVICE_WRITE); > } > + > +static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset) > +{ > + return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset); > +} > + > +static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset) > +{ > + return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset); > +} > + > +static inline bool is_device_exclusive_entry(swp_entry_t entry) > +{ > + return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ || > + swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE; > +} > + > +static inline bool is_writable_device_exclusive_entry(swp_entry_t entry) > +{ > + return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE); > +} > #else /* CONFIG_DEVICE_PRIVATE */ > static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset) > { > @@ -140,6 +161,26 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry) > { > return false; > } > + > +static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset) > +{ > + return swp_entry(0, 0); > +} > + > +static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset) > +{ > + return swp_entry(0, 0); > +} > + > +static inline bool is_device_exclusive_entry(swp_entry_t entry) > +{ > + return false; > +} > + > +static inline bool is_writable_device_exclusive_entry(swp_entry_t entry) > +{ > + return false; > +} > #endif /* CONFIG_DEVICE_PRIVATE */ > > #ifdef CONFIG_MIGRATION > @@ -219,7 +260,8 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry) > */ > static inline bool is_pfn_swap_entry(swp_entry_t entry) > { > - return is_migration_entry(entry) || is_device_private_entry(entry); > + return is_migration_entry(entry) || is_device_private_entry(entry) || > + is_device_exclusive_entry(entry); > } > > struct page_vma_mapped_walk; > diff --git a/mm/hmm.c b/mm/hmm.c > index 11df3ca30b82..fad6be2bf072 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -26,6 +26,8 @@ > #include <linux/mmu_notifier.h> > #include <linux/memory_hotplug.h> > > +#include "internal.h" > + > struct hmm_vma_walk { > struct hmm_range *range; > unsigned long last; > @@ -271,6 +273,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, > if (!non_swap_entry(entry)) > goto fault; > > + if (is_device_exclusive_entry(entry)) > + goto fault; > + > if (is_migration_entry(entry)) { > pte_unmap(ptep); > hmm_vma_walk->last = addr; > diff --git a/mm/memory.c b/mm/memory.c > index e061cfa18c11..c1d2d732f189 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -700,6 +700,68 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, > } > #endif > > +static void restore_exclusive_pte(struct vm_area_struct *vma, > + struct page *page, unsigned long address, > + pte_t *ptep) > +{ > + pte_t pte; > + swp_entry_t entry; > + > + pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot))); > + if (pte_swp_soft_dirty(*ptep)) > + pte = pte_mksoft_dirty(pte); > + > + entry = pte_to_swp_entry(*ptep); > + if (pte_swp_uffd_wp(*ptep)) > + pte = pte_mkuffd_wp(pte); > + else if (is_writable_device_exclusive_entry(entry)) > + pte = maybe_mkwrite(pte_mkdirty(pte), vma); > + > + set_pte_at(vma->vm_mm, address, ptep, pte); > + > + /* > + * No need to take a page reference as one was already > + * created when the swap entry was made. > + */ > + if (PageAnon(page)) > + page_add_anon_rmap(page, vma, address, false); > + else > + /* > + * Currently device exclusive access only supports anonymous > + * memory so the entry shouldn't point to a filebacked page. > + */ > + WARN_ON_ONCE(!PageAnon(page)); > + > + if (vma->vm_flags & VM_LOCKED) > + mlock_vma_page(page); > + > + /* > + * No need to invalidate - it was non-present before. However > + * secondary CPUs may have mappings that need invalidating. > + */ > + update_mmu_cache(vma, address, ptep); > +} > + > +/* > + * Tries to restore an exclusive pte if the page lock can be acquired without > + * sleeping. > + */ > +static unsigned long Better return a int? > +try_restore_exclusive_pte(struct mm_struct *src_mm, pte_t *src_pte, > + struct vm_area_struct *vma, unsigned long addr) Raised in the other thread too: src_mm can be dropped. > +{ > + swp_entry_t entry = pte_to_swp_entry(*src_pte); > + struct page *page = pfn_swap_entry_to_page(entry); > + > + if (trylock_page(page)) { > + restore_exclusive_pte(vma, page, addr, src_pte); > + unlock_page(page); > + return 0; > + } > + > + return -EBUSY; > +} > + > /* > * copy one vm_area from one task to the other. Assumes the page tables > * already present in the new task to be cleared in the whole range > @@ -781,6 +843,17 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, > pte = pte_swp_mkuffd_wp(pte); > set_pte_at(src_mm, addr, src_pte, pte); > } > + } else if (is_device_exclusive_entry(entry)) { > + /* > + * Make device exclusive entries present by restoring the > + * original entry then copying as for a present pte. Device > + * exclusive entries currently only support private writable > + * (ie. COW) mappings. > + */ > + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); > + if (try_restore_exclusive_pte(src_mm, src_pte, vma, addr)) > + return -EBUSY; > + return -ENOENT; > } > set_pte_at(dst_mm, addr, dst_pte, pte); > return 0; > @@ -980,9 +1053,18 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, > if (ret == -EAGAIN) { > entry = pte_to_swp_entry(*src_pte); > break; > + } else if (ret == -EBUSY) { > + break; > + } else if (!ret) { > + progress += 8; > + continue; > } > - progress += 8; > - continue; > + > + /* > + * Device exclusive entry restored, continue by copying > + * the now present pte. > + */ > + WARN_ON_ONCE(ret != -ENOENT); The change looks right, thanks. It's just that we should start to consider document all these err code now in copy_pte_range() some day (perhaps on top of this patch).. > } > /* copy_present_pte() will clear `*prealloc' if consumed */ > ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, > @@ -1019,6 +1101,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, > goto out; > } > entry.val = 0; > + } else if (ret == -EBUSY) { > + return -EBUSY; > } else if (ret) { > WARN_ON_ONCE(ret != -EAGAIN); > prealloc = page_copy_prealloc(src_mm, src_vma, addr); > @@ -1283,7 +1367,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > } > > entry = pte_to_swp_entry(ptent); > - if (is_device_private_entry(entry)) { > + if (is_device_private_entry(entry) || > + is_device_exclusive_entry(entry)) { > struct page *page = pfn_swap_entry_to_page(entry); > > if (unlikely(details && details->check_mapping)) { > @@ -1299,7 +1384,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > > pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); > rss[mm_counter(page)]--; > - page_remove_rmap(page, false); > + > + if (is_device_private_entry(entry)) > + page_remove_rmap(page, false); > + > put_page(page); > continue; > } > @@ -3303,6 +3391,35 @@ void unmap_mapping_range(struct address_space *mapping, > } > EXPORT_SYMBOL(unmap_mapping_range); > > +/* > + * Restore a potential device exclusive pte to a working pte entry > + */ > +static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) > +{ > + struct page *page = vmf->page; > + struct vm_area_struct *vma = vmf->vma; > + vm_fault_t ret = 0; > + struct mmu_notifier_range range; > + > + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) > + return VM_FAULT_RETRY; > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, > + vmf->address & PAGE_MASK, > + (vmf->address & PAGE_MASK) + PAGE_SIZE); @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no longer have exclusive access to the page. Shouldn't this be the place to use new MMU_NOTIFY_EXCLUSIVE? > + mmu_notifier_invalidate_range_start(&range); > + > + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, > + &vmf->ptl); > + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) > + restore_exclusive_pte(vma, page, vmf->address, vmf->pte); > + > + pte_unmap_unlock(vmf->pte, vmf->ptl); > + unlock_page(page); > + > + mmu_notifier_invalidate_range_end(&range); > + return ret; We can drop "ret" and return 0 here directly. > +} > + > /* > * We enter with non-exclusive mmap_lock (to exclude vma changes, > * but allow concurrent faults), and pte mapped but not yet locked. > @@ -3330,6 +3447,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (is_migration_entry(entry)) { > migration_entry_wait(vma->vm_mm, vmf->pmd, > vmf->address); > + } else if (is_device_exclusive_entry(entry)) { > + vmf->page = pfn_swap_entry_to_page(entry); > + ret = remove_device_exclusive_entry(vmf); > } else if (is_device_private_entry(entry)) { > vmf->page = pfn_swap_entry_to_page(entry); > ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); > diff --git a/mm/mprotect.c b/mm/mprotect.c > index ee5961888e70..883e2cc85cad 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -165,6 +165,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > newpte = swp_entry_to_pte(entry); > if (pte_swp_uffd_wp(oldpte)) > newpte = pte_swp_mkuffd_wp(newpte); > + } else if (is_writable_device_exclusive_entry(entry)) { > + entry = make_readable_device_exclusive_entry( > + swp_offset(entry)); > + newpte = swp_entry_to_pte(entry); > + if (pte_swp_soft_dirty(oldpte)) > + newpte = pte_swp_mksoft_dirty(newpte); > + if (pte_swp_uffd_wp(oldpte)) > + newpte = pte_swp_mkuffd_wp(newpte); > } else { > newpte = oldpte; > } > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > index a6a7febb4d93..f535bcb4950c 100644 > --- a/mm/page_vma_mapped.c > +++ b/mm/page_vma_mapped.c > @@ -41,7 +41,8 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw) > > /* Handle un-addressable ZONE_DEVICE memory */ > entry = pte_to_swp_entry(*pvmw->pte); > - if (!is_device_private_entry(entry)) > + if (!is_device_private_entry(entry) && > + !is_device_exclusive_entry(entry)) > return false; > } else if (!pte_present(*pvmw->pte)) > return false; > @@ -93,7 +94,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) > return false; > entry = pte_to_swp_entry(*pvmw->pte); > > - if (!is_migration_entry(entry)) > + if (!is_migration_entry(entry) && > + !is_device_exclusive_entry(entry)) > return false; > > pfn = swp_offset(entry); > @@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) > > /* Handle un-addressable ZONE_DEVICE memory */ > entry = pte_to_swp_entry(*pvmw->pte); > - if (!is_device_private_entry(entry)) > + if (!is_device_private_entry(entry) && > + !is_device_exclusive_entry(entry)) > return false; > > pfn = swp_offset(entry); > diff --git a/mm/rmap.c b/mm/rmap.c > index 8ed1853060cf..fe062f63ef4d 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -2008,6 +2008,192 @@ void page_mlock(struct page *page) > rmap_walk(page, &rwc); > } > > +struct make_exclusive_args { > + struct mm_struct *mm; > + unsigned long address; > + void *owner; > + bool valid; > +}; > + > +static bool page_make_device_exclusive_one(struct page *page, > + struct vm_area_struct *vma, unsigned long address, void *priv) > +{ > + struct mm_struct *mm = vma->vm_mm; > + struct page_vma_mapped_walk pvmw = { > + .page = page, > + .vma = vma, > + .address = address, > + }; > + struct make_exclusive_args *args = priv; > + pte_t pteval; > + struct page *subpage; > + bool ret = true; > + struct mmu_notifier_range range; > + swp_entry_t entry; > + pte_t swp_pte; > + > + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma, Similar question here, EXCLUSIVE comment says it gets notified when the device does not have exclusive access. If you prefer to keep using EXCLUSIVE for both mark/restore, then we need to change the comment above MMU_NOTIFY_EXCLUSIVE? > + vma->vm_mm, address, min(vma->vm_end, > + address + page_size(page)), args->owner); > + mmu_notifier_invalidate_range_start(&range); > + > + while (page_vma_mapped_walk(&pvmw)) { > + /* Unexpected PMD-mapped THP? */ > + VM_BUG_ON_PAGE(!pvmw.pte, page); > + > + if (!pte_present(*pvmw.pte)) { > + ret = false; > + page_vma_mapped_walk_done(&pvmw); > + break; > + } > + > + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); I see that all pages passed in should be done after FOLL_SPLIT_PMD, so is this needed? Or say, should subpage==page always be true? > + address = pvmw.address; > + > + /* Nuke the page table entry. */ > + flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); > + pteval = ptep_clear_flush(vma, address, pvmw.pte); > + > + /* Move the dirty bit to the page. Now the pte is gone. */ > + if (pte_dirty(pteval)) > + set_page_dirty(page); > + > + if (arch_unmap_one(mm, vma, address, pteval) < 0) { > + set_pte_at(mm, address, pvmw.pte, pteval); > + ret = false; > + page_vma_mapped_walk_done(&pvmw); > + break; > + } Didn't notice this previously, but also suggest to drop this. Two reasons: 1. It's introduced in ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap", 2018-03-18) for sparc-only use so far. If we really want this, we'll also want to call arch_do_swap_page() when restoring the pte just like what we do in do_swap_page(); NOTE: current code path of SWP_DEVICE_EXCLUSIVE will skip the arch_do_swap_page() in do_swap_page() so it's not even paired with the above arch_unmap_one(), so I believe this won't even work for sparc at all. 2. I highly doubt whether sparc is also on the list of platforms to support for device atomic ops even in the future. IMHO we'd better not copy-paste code clips if never used at all, because once merged, removing it would need more justifications. > + > + /* > + * Check that our target page is still mapped at the expected > + * address. > + */ > + if (args->mm == mm && args->address == address && > + pte_write(pteval)) > + args->valid = true; > + > + /* > + * Store the pfn of the page in a special migration > + * pte. do_swap_page() will wait until the migration > + * pte is removed and then restart fault handling. > + */ > + if (pte_write(pteval)) > + entry = make_writable_device_exclusive_entry( > + page_to_pfn(subpage)); > + else > + entry = make_readable_device_exclusive_entry( > + page_to_pfn(subpage)); > + swp_pte = swp_entry_to_pte(entry); > + if (pte_soft_dirty(pteval)) > + swp_pte = pte_swp_mksoft_dirty(swp_pte); > + if (pte_uffd_wp(pteval)) > + swp_pte = pte_swp_mkuffd_wp(swp_pte); > + > + /* Take a reference for the swap entry */ > + get_page(page); > + set_pte_at(mm, address, pvmw.pte, swp_pte); > + > + page_remove_rmap(subpage, PageHuge(page)); Why PageHuge()? Should it be a constant "false"? > + put_page(page); Should we drop this put_page() along with get_page() above? page_count() should be >0 anyway as we've got a mapcount before at least when dropping the pte. Then IMHO we can simply keep the old page reference. > + } > + > + mmu_notifier_invalidate_range_end(&range); > + > + return ret; > +} > + > +/** > + * page_make_device_exclusive - replace page table mappings with swap entries "with swap entries" looks a bit blurred to me (although below longer comment explains much better). How about below (or something similar): page_make_device_exclusive - Mark the page exclusively owned by the device ? It'll also match with comment above make_device_exclusive_range(). No strong opinion. The rest looks good. Thanks, > + * @page: the page to replace page table entries for > + * @mm: the mm_struct where the page is expected to be mapped > + * @address: address where the page is expected to be mapped > + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks > + * > + * Tries to remove all the page table entries which are mapping this page and > + * replace them with special device exclusive swap entries to grant a device > + * exclusive access to the page. Caller must hold the page lock. > + * > + * Returns false if the page is still mapped, or if it could not be unmapped > + * from the expected address. Otherwise returns true (success). > + */ > +static bool page_make_device_exclusive(struct page *page, struct mm_struct *mm, > + unsigned long address, void *owner) > +{ > + struct make_exclusive_args args = { > + .mm = mm, > + .address = address, > + .owner = owner, > + .valid = false, > + }; > + struct rmap_walk_control rwc = { > + .rmap_one = page_make_device_exclusive_one, > + .done = page_not_mapped, > + .anon_lock = page_lock_anon_vma_read, > + .arg = &args, > + }; > + > + /* > + * Restrict to anonymous pages for now to avoid potential writeback > + * issues. > + */ > + if (!PageAnon(page)) > + return false; > + > + rmap_walk(page, &rwc); > + > + return args.valid && !page_mapcount(page); > +} > + > +/** > + * make_device_exclusive_range() - Mark a range for exclusive use by a device > + * @mm: mm_struct of assoicated target process > + * @start: start of the region to mark for exclusive device access > + * @end: end address of region > + * @pages: returns the pages which were successfully marked for exclusive access > + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering > + * > + * Returns: number of pages found in the range by GUP. A page is marked for > + * exclusive access only if the page pointer is non-NULL. > + * > + * This function finds ptes mapping page(s) to the given address range, locks > + * them and replaces mappings with special swap entries preventing userspace CPU > + * access. On fault these entries are replaced with the original mapping after > + * calling MMU notifiers. > + * > + * A driver using this to program access from a device must use a mmu notifier > + * critical section to hold a device specific lock during programming. Once > + * programming is complete it should drop the page lock and reference after > + * which point CPU access to the page will revoke the exclusive access. > + */ > +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, > + unsigned long end, struct page **pages, > + void *owner) > +{ > + unsigned long npages = (end - start) >> PAGE_SHIFT; > + unsigned long i; > + > + npages = get_user_pages_remote(mm, start, npages, > + FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD, > + pages, NULL, NULL); > + for (i = 0; i < npages; i++, start += PAGE_SIZE) { > + if (!trylock_page(pages[i])) { > + put_page(pages[i]); > + pages[i] = NULL; > + continue; > + } > + > + if (!page_make_device_exclusive(pages[i], mm, start, owner)) { > + unlock_page(pages[i]); > + put_page(pages[i]); > + pages[i] = NULL; > + } > + } > + > + return npages; > +} > +EXPORT_SYMBOL_GPL(make_device_exclusive_range); > + > void __put_anon_vma(struct anon_vma *anon_vma) > { > struct anon_vma *root = anon_vma->root; > -- > 2.20.1 > -- Peter Xu