On Tuesday, 18 May 2021 11:19:05 PM AEST Alistair Popple wrote: [...] > > > +/* > > > + * Restore a potential device exclusive pte to a working pte entry > > > + */ > > > +static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) > > > +{ > > > + struct page *page = vmf->page; > > > + struct vm_area_struct *vma = vmf->vma; > > > + struct page_vma_mapped_walk pvmw = { > > > + .page = page, > > > + .vma = vma, > > > + .address = vmf->address, > > > + .flags = PVMW_SYNC, > > > + }; > > > + vm_fault_t ret = 0; > > > + struct mmu_notifier_range range; > > > + > > > + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) > > > + return VM_FAULT_RETRY; > > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, > > > vma->vm_mm, > > > + vmf->address & PAGE_MASK, > > > + (vmf->address & PAGE_MASK) + PAGE_SIZE); > > > + mmu_notifier_invalidate_range_start(&range); > > > > I looked at MMU_NOTIFIER_CLEAR document and it tells me: > > > > * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like > > * madvise() or replacing a page by another one, ...). > > > > Does MMU_NOTIFIER_CLEAR suite for this case? Normally I think for such a > > case (existing pte is invalid) we don't need to notify at all. However > > from what I read from the whole series, this seems to be a critical point > > where we'd like to kick the owner/driver to synchronously stop doing > > atomic > > operations from the device. Not sure whether we'd like a new notifier > > type, or maybe at least comment on why to use CLEAR? > > Right, notifying the owner/driver when it no longer has exclusive access to > the page and allowing it to stop atomic operations is the critical point and > why it notifies when we ordinarily wouldn't (ie. invalid -> valid). > > I did consider adding a new type, but in the driver implementation it ends > up > being treated the same as a CLEAR notification anyway so didn't think it was > necessary. But I suppose adding a different type would allow other listening > notifiers to filter these which might be worthwhile. > > > > + > > > + while (page_vma_mapped_walk(&pvmw)) { > > > > IIUC a while loop of page_vma_mapped_walk() handles thps, however here > > it's > > already in do_swap_page() so it's small pte only? Meanwhile... > > > > > + if (unlikely(!pte_same(*pvmw.pte, vmf->orig_pte))) { > > > + page_vma_mapped_walk_done(&pvmw); > > > + break; > > > + } > > > + > > > + restore_exclusive_pte(vma, page, pvmw.address, pvmw.pte); > > > > ... I'm not sure whether passing in page would work for thp after all, as > > iiuc we may need to pass in the subpage rather than the head page if so. > > Hmm, I need to check this and follow up. Thanks Peter for pointing that out. I needed to follow this up because I had slightly misunderstood page_vma_mapped_walk(). As you say this is actually a small pte and restore_exclusive_pte() needs the actual page from the fault. So I should be able to drop the page_vma_mapped_walk() and use pte_offset_map_lock() to call restore_exclusive_pte directly. - Alistair