Hi Alistair, > >> > >> Yes, although obviously as I think you point out below you wouldn't be > >> able to take any sleeping locks in mmu_notifier_update_mapping(). > > Yes, I understand that, but I am not sure how we can prevent any potential > > notifier callback from taking sleeping locks other than adding clear > comments. > > Oh of course not, but is such a restriction on not taking sleeping locks > acceptable for your implementation of the notifier callback? I notice in > patch 2 update_udmabuf() takes a mutex so I assumed not being able to > sleep in the callback would be an issue. I plan to drop the mutex in v2 which is not really needed as I described in my previous reply because we ensure Guest and Host synchronization via other means. > > >> > >> > In which case I'd need to make a similar change in the shmem path as > well. > >> > And, also redo (or eliminate) the locking in udmabuf (patch) which > seems a > >> > bit excessive on a second look given our use-case (where reads and > writes > >> do > >> > not happen simultaneously due to fence synchronization in the guest > >> driver). > >> > >> I'm not at all familiar with the udmabuf use case but that sounds > >> brittle and effectively makes this notifier udmabuf specific right? > > Oh, Qemu uses the udmabuf driver to provide Host Graphics components > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created > > buffers. In other words, from a core mm standpoint, udmabuf just > > collects a bunch of pages (associated with buffers) scattered inside > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps > > them in a dmabuf fd. And, since we provide zero-copy access, we > > use DMA fences to ensure that the components on the Host and > > Guest do not access the buffer simultaneously. > > Thanks for the background! > > >> contemplated adding a notifier for PTE updates for drivers using > >> hmm_range_fault() as it would save some expensive device faults and it > >> this could be useful for that. > >> > >> So if we're adding a notifier for PTE updates I think it would be good > >> if it covered all cases and was robust enough to allow mirroring of the > >> correct PTE value (ie. by being called under PTL or via some other > >> synchronisation like hmm_range_fault()). > > Ok; in order to make it clear that the notifier is associated with PTE > updates, > > I think it needs to have a more suitable name such as > mmu_notifier_update_pte() > > or mmu_notifier_new_pte(). But we already have > mmu_notifier_change_pte, > > which IIUC is used mainly for PTE updates triggered by KSM. So, I am > inclining > > towards dropping this new notifier and instead adding a new flag to > change_pte > > to distinguish between KSM triggered notifications and others. Something > along > > the lines of: > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > index 218ddc3b4bc7..6afce2287143 100644 > > --- a/include/linux/mmu_notifier.h > > +++ b/include/linux/mmu_notifier.h > > @@ -129,7 +129,8 @@ struct mmu_notifier_ops { > > void (*change_pte)(struct mmu_notifier *subscription, > > struct mm_struct *mm, > > unsigned long address, > > - pte_t pte); > > + pte_t pte, > > + bool ksm_update); > > @@ -658,7 +659,7 @@ static inline void mmu_notifier_range_init_owner( > > unsigned long ___address = __address; \ > > pte_t ___pte = __pte; \ > > \ > > - mmu_notifier_change_pte(___mm, ___address, ___pte); \ > > + mmu_notifier_change_pte(___mm, ___address, ___pte, true); \ > > > > And replace mmu_notifier_update_mapping(vma->vm_mm, address, > pte_pfn(*ptep)) > > in the current patch with > > mmu_notifier_change_pte(vma->vm_mm, address, ptep, false)); > > I wonder if we actually need the flag? IIUC it is already used for more > than just KSM. For example it can be called as part of fault handling by > set_pte_at_notify() in in wp_page_copy(). Yes, I noticed that but what I really meant is I'd put all these prior instances of change_pte in one category using the flag. Without the flag, KVM, the only user that currently has a callback for change_pte would get notified which may not be appropriate. Note that the change_pte callback for KVM was added (based on Git log) for KSM updates and it is not clear to me if that is still the case. > > > Would that work for your HMM use-case -- assuming we call change_pte > after > > taking PTL? > > I suspect being called under the PTL could be an issue. For HMM we use a > combination of sequence numbers and a mutex to synchronise PTEs. To > avoid calling the notifier while holding PTL we might be able to record > the sequence number (subscriptions->invalidate_seq) while holding PTL, > release the PTL and provide that sequence number to the notifier > callback along with the PTE. > > Some form of mmu_interval_read_retry() could then be used to detect if > the PTE has changed between dropping the PTL and calling the > update_pte()/change_pte() notifier. > > Of course if your design can handle being called while holding the PTL > then the above is probably unnecessarily complex for your use-case. Yes, I believe we can handle it while holding the PTL. > > My primary issue with this patch is the notifier is called without the > PTL while providing a PTE value. Without some form of synchronisation it > isn't safe to use the result of eg. pte_page(pte) or pte_write(pte) in > the notifier callback. Based on your comments it seems udmabuf might > have some other synchronisation that makes it safe, but being external > to the notifier calls make it's hard to reason about. I intend to fix the PTL issue in v2 but I am still not sure what is the best thing to do as far as the notifier is concerned given the following options: - Keep this patch (and notifier name) but ensure that it is called under PTL - Drop this patch and expand the use of change_pte but add the flag to distinguish between prior usage and new usage - Keep this patch but don't include the PTE or the pfn of the new page as part of the notifier. In other words, just have this: mmu_notifier_update_mapping(struct mm_struct *mm, unsigned long address) This way, in udmabuf driver, we could get the new page from the page cache as soon as we get notified: mapoff = linear_page_index(vma, address); new_page = find_get_page(vma->vm_file->f_mapping, mapoff); This last option would probably limit the new notifier to the udmabuf use-case but I do not intend to pursue it as you suggested that you are also interested in a new notifier associated with PTE updates. Thanks, Vivek > > - Alistair > > > Thanks, > > Vivek > > > >> > >> Thanks. > >> > >> > Thanks, > >> > Vivek > >> > > >> >> > >> >> > + return ret; > >> >> > > >> >> > ret = 0; > >> >> > > >> >> > @@ -6223,6 +6227,9 @@ vm_fault_t hugetlb_fault(struct mm_struct > >> *mm, > >> >> struct vm_area_struct *vma, > >> >> > */ > >> >> > if (need_wait_lock) > >> >> > folio_wait_locked(folio); > >> >> > + if (!ret) > >> >> > + mmu_notifier_update_mapping(vma->vm_mm, > address, > >> >> > + pte_pfn(*ptep)); > >> >> > return ret; > >> >> > } > >> >> > > >> >> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > >> >> > index 50c0dde1354f..6421405334b9 100644 > >> >> > --- a/mm/mmu_notifier.c > >> >> > +++ b/mm/mmu_notifier.c > >> >> > @@ -441,6 +441,23 @@ void __mmu_notifier_change_pte(struct > >> >> mm_struct *mm, unsigned long address, > >> >> > srcu_read_unlock(&srcu, id); > >> >> > } > >> >> > > >> >> > +void __mmu_notifier_update_mapping(struct mm_struct *mm, > >> unsigned > >> >> long address, > >> >> > + unsigned long pfn) > >> >> > +{ > >> >> > + struct mmu_notifier *subscription; > >> >> > + int id; > >> >> > + > >> >> > + id = srcu_read_lock(&srcu); > >> >> > + hlist_for_each_entry_rcu(subscription, > >> >> > + &mm->notifier_subscriptions->list, > hlist, > >> >> > + srcu_read_lock_held(&srcu)) { > >> >> > + if (subscription->ops->update_mapping) > >> >> > + subscription->ops- > >update_mapping(subscription, > >> >> mm, > >> >> > + address, > pfn); > >> >> > + } > >> >> > + srcu_read_unlock(&srcu, id); > >> >> > +} > >> >> > + > >> >> > static int mn_itree_invalidate(struct mmu_notifier_subscriptions > >> >> *subscriptions, > >> >> > const struct mmu_notifier_range *range) > >> >> > { > >> >> > diff --git a/mm/shmem.c b/mm/shmem.c > >> >> > index 2f2e0e618072..e59eb5fafadb 100644 > >> >> > --- a/mm/shmem.c > >> >> > +++ b/mm/shmem.c > >> >> > @@ -77,6 +77,7 @@ static struct vfsmount *shm_mnt; > >> >> > #include <linux/fcntl.h> > >> >> > #include <uapi/linux/memfd.h> > >> >> > #include <linux/rmap.h> > >> >> > +#include <linux/mmu_notifier.h> > >> >> > #include <linux/uuid.h> > >> >> > > >> >> > #include <linux/uaccess.h> > >> >> > @@ -2164,8 +2165,12 @@ static vm_fault_t shmem_fault(struct > >> vm_fault > >> >> *vmf) > >> >> > gfp, vma, vmf, &ret); > >> >> > if (err) > >> >> > return vmf_error(err); > >> >> > - if (folio) > >> >> > + if (folio) { > >> >> > vmf->page = folio_file_page(folio, vmf->pgoff); > >> >> > + if (ret == VM_FAULT_LOCKED) > >> >> > + mmu_notifier_update_mapping(vma- > >vm_mm, vmf- > >> >> >address, > >> >> > + page_to_pfn(vmf- > >page)); > >> >> > + } > >> >> > return ret; > >> >> > }