Alistair, While I got one reply below to your previous email, I also looked at the rest code (majorly restore and fork sides) today and added a few more comments. On Tue, May 18, 2021 at 11:19:05PM +1000, Alistair Popple wrote: [...] > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 3a5705cfc891..556ff396f2e9 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -700,6 +700,84 @@ struct page *vm_normal_page_pmd(struct vm_area_struct > > > *vma, unsigned long addr,> > > > } > > > #endif > > > > > > +static void restore_exclusive_pte(struct vm_area_struct *vma, > > > + struct page *page, unsigned long address, > > > + pte_t *ptep) > > > +{ > > > + pte_t pte; > > > + swp_entry_t entry; > > > + > > > + pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot))); > > > + if (pte_swp_soft_dirty(*ptep)) > > > + pte = pte_mksoft_dirty(pte); > > > + > > > + entry = pte_to_swp_entry(*ptep); > > > + if (pte_swp_uffd_wp(*ptep)) > > > + pte = pte_mkuffd_wp(pte); > > > + else if (is_writable_device_exclusive_entry(entry)) > > > + pte = maybe_mkwrite(pte_mkdirty(pte), vma); > > > + > > > + set_pte_at(vma->vm_mm, address, ptep, pte); > > > + > > > + /* > > > + * No need to take a page reference as one was already > > > + * created when the swap entry was made. > > > + */ > > > + if (PageAnon(page)) > > > + page_add_anon_rmap(page, vma, address, false); > > > + else > > > + page_add_file_rmap(page, false); This seems to be another leftover; maybe WARN_ON_ONCE(!PageAnon(page))? > > > + > > > + if (vma->vm_flags & VM_LOCKED) > > > + mlock_vma_page(page); > > > + > > > + /* > > > + * No need to invalidate - it was non-present before. However > > > + * secondary CPUs may have mappings that need invalidating. > > > + */ > > > + update_mmu_cache(vma, address, ptep); > > > +} [...] > > > /* > > > > > > * copy one vm_area from one task to the other. Assumes the page tables > > > * already present in the new task to be cleared in the whole range > > > > > > @@ -781,6 +859,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct > > > mm_struct *src_mm,> > > > pte = pte_swp_mkuffd_wp(pte); > > > > > > set_pte_at(src_mm, addr, src_pte, pte); > > > > > > } > > > > > > + } else if (is_device_exclusive_entry(entry)) { > > > + /* COW mappings should be dealt with by removing the entry > > > */ Here the comment says "removing the entry" however I think it didn't remove the pte, instead it keeps it (as there's no "return", so set_pte_at() will be called below), so I got a bit confused. > > > + VM_BUG_ON(is_cow_mapping(vm_flags)); Also here, if PageAnon() is the only case to support so far, doesn't that easily satisfy is_cow_mapping()? Maybe I missed something.. I also have a pure and high level question regarding a process fork() when there're device exclusive ptes: would the two processes then own the device together? Is this a real usecase? Indeed it'll be odd for a COW page since for COW page then it means after parent/child writting to the page it'll clone into two, then it's a mistery on which one will be the one that "exclusived owned" by the device.. > > > + page = pfn_swap_entry_to_page(entry); > > > + get_page(page); > > > + rss[mm_counter(page)]++; > > > > > > } > > > set_pte_at(dst_mm, addr, dst_pte, pte); > > > return 0; > > > > > > @@ -947,6 +1031,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct > > > vm_area_struct *src_vma,> > > > int rss[NR_MM_COUNTERS]; > > > swp_entry_t entry = (swp_entry_t){0}; > > > struct page *prealloc = NULL; > > > > > > + struct page *locked_page = NULL; > > > > > > again: > > > progress = 0; > > > > > > @@ -980,13 +1065,36 @@ copy_pte_range(struct vm_area_struct *dst_vma, > > > struct vm_area_struct *src_vma,> > > > continue; > > > > > > } > > > if (unlikely(!pte_present(*src_pte))) { > > > > > > - entry.val = copy_nonpresent_pte(dst_mm, src_mm, > > > - dst_pte, src_pte, > > > - src_vma, addr, rss); > > > - if (entry.val) > > > - break; > > > - progress += 8; > > > - continue; > > > + swp_entry_t swp_entry = pte_to_swp_entry(*src_pte); (Just a side note to all of us: this will be one more place that I'll need to look after in my uffd-wp series if this series lands first, as after that series we can only call pte_to_swp_entry after a pte_has_swap_entry check, as sometimes non-present pte won't contain a swap entry at all) > > > + > > > + if (unlikely(is_cow_mapping(src_vma->vm_flags) && > > > + is_device_exclusive_entry(swp_entry))) { > > > + /* > > > + * Normally this would require sending mmu > > > + * notifiers, but copy_page_range() has > > > already + * done that for COW mappings. > > > + */ > > > + ret = try_restore_exclusive_pte(src_mm, > > > src_pte, + > > > src_vma, addr, + > > > &locked_page); + if (ret == -EBUSY) > > > + break; Would it be possible that we put all the handling of device exclusive ptes into copy_nonpresent_pte()? As IMHO all device exclusive ptes should still be one kind of non-present pte. Splitting the cases really make it (at least to me...) even harder to read. Maybe you wanted to avoid the rework of copy_nonpresent_pte() as it currently returns a entry.val which is indeed not straightforward already.. I wanted to clean that up but not yet. An easier option is perhaps failing the fork() directly when trylock_page() failed when restoring the pte? So the userspace could try again the whole fork(). However that'll also depend on my previous question on whether this is a valid scenario after all. If "maintaining fork correctness" is the only thing we persue for, maybe still worth to consider? > > > + } else { > > > + entry.val = copy_nonpresent_pte(dst_mm, > > > src_mm, + > > > dst_pte, src_pte, + > > > src_vma, addr, + > > > rss); + if (entry.val) > > > + break; > > > + progress += 8; > > > + continue; > > > + } > > > + } > > > + /* a non-present pte became present after dropping the ptl > > > */ > > > + if (unlikely(locked_page)) { > > > + unlock_page(locked_page); > > > + put_page(locked_page); > > > + locked_page = NULL; > > > > > > } > > > /* copy_present_pte() will clear `*prealloc' if consumed */ > > > ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, > > > [...] > > > +static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) > > > +{ > > > + struct page *page = vmf->page; > > > + struct vm_area_struct *vma = vmf->vma; > > > + struct page_vma_mapped_walk pvmw = { > > > + .page = page, > > > + .vma = vma, > > > + .address = vmf->address, > > > + .flags = PVMW_SYNC, > > > + }; > > > + vm_fault_t ret = 0; > > > + struct mmu_notifier_range range; > > > + > > > + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) > > > + return VM_FAULT_RETRY; > > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, > > > vma->vm_mm, > > > + vmf->address & PAGE_MASK, > > > + (vmf->address & PAGE_MASK) + PAGE_SIZE); > > > + mmu_notifier_invalidate_range_start(&range); > > > > I looked at MMU_NOTIFIER_CLEAR document and it tells me: > > > > * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like > > * madvise() or replacing a page by another one, ...). > > > > Does MMU_NOTIFIER_CLEAR suite for this case? Normally I think for such a > > case (existing pte is invalid) we don't need to notify at all. However > > from what I read from the whole series, this seems to be a critical point > > where we'd like to kick the owner/driver to synchronously stop doing atomic > > operations from the device. Not sure whether we'd like a new notifier > > type, or maybe at least comment on why to use CLEAR? > > Right, notifying the owner/driver when it no longer has exclusive access to > the page and allowing it to stop atomic operations is the critical point and > why it notifies when we ordinarily wouldn't (ie. invalid -> valid). > > I did consider adding a new type, but in the driver implementation it ends up > being treated the same as a CLEAR notification anyway so didn't think it was > necessary. But I suppose adding a different type would allow other listening > notifiers to filter these which might be worthwhile. Sounds good to me. [...] > > > + /* > > > + * Check that our target page is still mapped at the > > > expected > > > + * address. > > > + */ > > > + if (ttp->mm == mm && ttp->address == address && > > > + pte_write(pteval)) > > > + ttp->valid = true; > > > > I think I get the point of doing this (as after GUP the pte could have been > > changed to point to another page), however it smells a bit odd to me (or > > it's also possible that I'm not familiar enough with the code base..). > > IIUC this is the _only_ reason that we passed in "address" into > > try_to_protect() too, and further into the ttp_args. > > Yes, this is why "address" is passed up to ttp_args. > > > The odd part is the remote GUP should have walked the page table already, so > > since the target here is the vaddr to replace, the 1st page table walk > > should be able to both trylock/lock the page, then modify the pte with > > pgtable lock held, return the locked page, then walk the rmap again to > > remove all the rest of the ptes that are mapping to this page. In that > > case before we call the rmap_walk() we know this must be the page we want > > to take care of, and no one will be able to restore the original mm pte > > either (as we're with the page lock). Then we don't need this check, > > neither do we need ttp->address. > > If I am understanding you correctly I think this would be similar to the > approach that was taken in v2. However it pretty much ended up being just an > open-coded version of gup which is useful anyway to fault the page in. I see. For easier reference this is v2 patch 1: https://lore.kernel.org/lkml/20210219020750.16444-2-apopple@xxxxxxxxxx/ Indeed that looks like it, it's just that instead of grabbing the page only in hmm_exclusive_pmd() we can do the pte modification along the way to seal the whole thing (address/pte & page). I saw Christoph and Jason commented in that patch, but not regarding to this approach. So is there a reason that you switched? Do you think it'll work? I have no strong opinion either, it's just not crystal clear why we'd need that ttp->address at all for a rmap walk along with that "valid" field. Meanwhile it should be slightly less efficient too to go with current approach, especially when the page array gets huge, I think: since there'll be longer time we do GUP before doing the rmap walk, so higher possibility that the GUPed pages got replaced for whatever reason. Then the call to make_device_exclusive_range() will fail as a whole just for a single page replacement within the range. Thanks, -- Peter Xu