On Mon, 4 May 2020 15:44:36 -0300 Jason Gunthorpe <jgg@xxxxxxxx> wrote: > On Mon, May 04, 2020 at 12:26:43PM -0600, Alex Williamson wrote: > > On Fri, 1 May 2020 20:48:49 -0300 > > Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > > > > On Fri, May 01, 2020 at 03:39:30PM -0600, Alex Williamson wrote: > > > > > > > static int vfio_pci_add_vma(struct vfio_pci_device *vdev, > > > > struct vm_area_struct *vma) > > > > { > > > > @@ -1346,15 +1450,49 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > > > > { > > > > struct vm_area_struct *vma = vmf->vma; > > > > struct vfio_pci_device *vdev = vma->vm_private_data; > > > > + vm_fault_t ret = VM_FAULT_NOPAGE; > > > > > > > > - if (vfio_pci_add_vma(vdev, vma)) > > > > - return VM_FAULT_OOM; > > > > + /* > > > > + * Zap callers hold memory_lock and acquire mmap_sem, we hold > > > > + * mmap_sem and need to acquire memory_lock to avoid races with > > > > + * memory bit settings. Release mmap_sem, wait, and retry, or fail. > > > > + */ > > > > + if (unlikely(!down_read_trylock(&vdev->memory_lock))) { > > > > + if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) { > > > > + if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) > > > > + return VM_FAULT_RETRY; > > > > + > > > > + up_read(&vma->vm_mm->mmap_sem); > > > > + > > > > + if (vmf->flags & FAULT_FLAG_KILLABLE) { > > > > + if (!down_read_killable(&vdev->memory_lock)) > > > > + up_read(&vdev->memory_lock); > > > > + } else { > > > > + down_read(&vdev->memory_lock); > > > > + up_read(&vdev->memory_lock); > > > > + } > > > > + return VM_FAULT_RETRY; > > > > + } > > > > + return VM_FAULT_SIGBUS; > > > > + } > > > > > > So, why have the wait? It isn't reliable - if this gets faulted from a > > > call site that can't handle retry then it will SIGBUS anyhow? > > > > Do such call sites exist? My assumption was that half of the branch > > was unlikely to ever occur. > > hmm_range_fault() for instance doesn't set ALLOW_RETRY, I assume there > are enough other case to care about, but am not so sure > > > > The weird use of a rwsem as a completion suggest that perhaps using > > > wait_event might improve things: > > > > > > disable: > > > // Clean out the vma list with zap, then: > > > > > > down_read(mm->mmap_sem) > > > > I assume this is simplifying the dance we do in zapping to first take > > vma_lock in order to walk vma_list, to find a vma from which we can > > acquire the mm, drop vma_lock, get mmap_sem, then re-get vma_lock > > below. > > No, that has to stay.. Sorry, I stated that unclearly, I'm assuming we keep that and it's been omitted from this pseudo code for simplicity. > > Also accounting that vma_list might be empty and we might need > > to drop and re-acquire vma_lock to get to another mm, so we really > > probably want to set pause_faults at the start rather than at the end. > > New vmas should not created/faulted while vma_lock is held, so the > order shouldn't matter.. Technically that's true, but if vfio_pci_zap_mmap_vmas() drops vma_lock to go back and get another mm, then vm_ops.fault() could get another vma into the list while we're trying to zap and clear them all. The result is the same, but we might be doing unnecessary work versus holding off the fault from the start. > > > mutex_lock(vma_lock); > > > list_for_each_entry_safe() > > > // zap and remove all vmas > > > > > > pause_faults = true; > > > mutex_write(vma_lock); > > > > > > fault: > > > // Already have down_read(mmap_sem) > > > mutex_lock(vma_lock); > > > while (pause_faults) { > > > mutex_unlock(vma_lock) > > > wait_event(..., !pause_faults) > > > mutex_lock(vma_lock) > > > } > > > > Nit, we need to test the memory enable bit setting somewhere under this > > lock since it seems to be the only thing protecting it now. > > I was thinking you'd keep the same locking for the memory enable bit, > the pause_faults is a shadow of that bit with locking connected to > vma_lock.. Oh! I totally did not get that! > > > list_add() > > > remap_pfn() > > > mutex_unlock(vma_lock) > > > > The read and write file ops would need similar mechanisms. > > Keep using the rwsem? > > > > enable: > > > pause_faults = false > > > wake_event() > > > > Hmm, vma_lock was dropped above and not re-acquired here. > > I was thinking this would be under a continous rwlock > > > I'm not sure if it was an oversight that pause_faults was not tested > > in the disable path, but this combination appears to lead to > > concurrent writers and serialized readers?? > > ? pause_faults only exists to prevent the vm_ops fault callback from > progressing to a fault. I don't think any concurrancy is lost > > > > The only requirement here is that while inside the write side of > > > memory_lock you cannot touch user pages (ie no copy_from_user/etc) > > > > I'm lost at this statement, I can only figure the above works if we > > remove memory_lock. Are you referring to a different lock? Thanks, > > No > > This is just an approach to avoid the ABBA deadlock problem when using > a rwsem by using a looser form of lock combined witih the already > correctly nested vma_lock. > > Stated another way, you can keep the existing memory_lock as is, if it > is structured like this: > > disable: > down_read(mmap_sem) > mutex_lock(vma_lock) > list_for_each_entry_safe() > // zap and remove all vmas > down_write(memory_lock) // Now inside vma_lock! > mutex_unlock(vma_lock) > up_read(mmap_sem > > [ do the existing stuff under memory_lock ] > > > fault: > mutex_lock(vma_lock) > down_write(memory_lock) > remap_pfn > up_write(memory_lock) > mutex_unlock(vma_lock) > > enable: > up_write(memory_lock) > > Ie the key is to organize things to move the down_write(memory_lock) > to be under the mmap_sem/vma_lock Ok, this all makes a lot more sense with memory_lock still in the picture. And it looks like you're not insisting on the wait_event, we can block on memory_lock so long as we don't have an ordering issue. I'll see what I can do. Thanks, Alex