On Mon, 4 May 2020 17:01:23 -0300 Jason Gunthorpe <jgg@xxxxxxxx> wrote: > On Mon, May 04, 2020 at 01:35:52PM -0600, Alex Williamson wrote: > > > Ok, this all makes a lot more sense with memory_lock still in the > > picture. And it looks like you're not insisting on the wait_event, we > > can block on memory_lock so long as we don't have an ordering issue. > > I'll see what I can do. Thanks, > > Right, you can block on the rwsem if it is ordered properly vs > mmap_sem. This is what I've come up with, please see if you agree with the logic: void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_device *vdev) { struct vfio_pci_mmap_vma *mmap_vma, *tmp; /* * Lock ordering: * vma_lock is nested under mmap_sem for vm_ops callback paths. * The memory_lock semaphore is used by both code paths calling * into this function to zap vmas and the vm_ops.fault callback * to protect the memory enable state of the device. * * When zapping vmas we need to maintain the mmap_sem => vma_lock * ordering, which requires using vma_lock to walk vma_list to * acquire an mm, then dropping vma_lock to get the mmap_sem and * reacquiring vma_lock. This logic is derived from similar * requirements in uverbs_user_mmap_disassociate(). * * mmap_sem must always be the top-level lock when it is taken. * Therefore we can only hold the memory_lock write lock when * vma_list is empty, as we'd need to take mmap_sem to clear * entries. vma_list can only be guaranteed empty when holding * vma_lock, thus memory_lock is nested under vma_lock. * * This enables the vm_ops.fault callback to acquire vma_lock, * followed by memory_lock read lock, while already holding * mmap_sem without risk of deadlock. */ while (1) { struct mm_struct *mm = NULL; mutex_lock(&vdev->vma_lock); while (!list_empty(&vdev->vma_list)) { mmap_vma = list_first_entry(&vdev->vma_list, struct vfio_pci_mmap_vma, vma_next); mm = mmap_vma->vma->vm_mm; if (mmget_not_zero(mm)) break; list_del(&mmap_vma->vma_next); kfree(mmap_vma); mm = NULL; } if (!mm) break; mutex_unlock(&vdev->vma_lock); down_read(&mm->mmap_sem); if (mmget_still_valid(mm)) { mutex_lock(&vdev->vma_lock); list_for_each_entry_safe(mmap_vma, tmp, &vdev->vma_list, vma_next) { struct vm_area_struct *vma = mmap_vma->vma; if (vma->vm_mm != mm) continue; list_del(&mmap_vma->vma_next); kfree(mmap_vma); zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); } mutex_unlock(&vdev->vma_lock); } up_read(&mm->mmap_sem); mmput(mm); } down_write(&vdev->memory_lock); mutex_unlock(&vdev->vma_lock); } As noted in the comment, the fault handler can simply do: mutex_lock(&vdev->vma_lock); down_read(&vdev->memory_lock); This should be deadlock free now, so we can drop the retry handling Paths needing to acquire memory_lock with vmas zapped (device reset, memory bit *->0 transition) call this function, perform their operation, then simply release with up_write(&vdev->memory_lock). Both the read and write version of acquiring memory_lock can still occur outside this function for operations that don't require flushing all vmas or otherwise touch vma_lock or mmap_sem (ex. read/write, MSI-X vector table access, writing *->1 to memory enable bit). I still need to work on the bus reset path as acquiring memory_lock write locks across multiple devices seems like it requires try-lock behavior, which is clearly complicated, or at least messy in the above function. Does this seem like it's going in a reasonable direction? Thanks, Alex