On Mon, 28 Jun 2021 15:52:42 -0300 Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > On Mon, Jun 28, 2021 at 12:36:21PM -0600, Alex Williamson wrote: > > On Mon, 28 Jun 2021 14:30:28 -0300 > > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > > > On Mon, Jun 28, 2021 at 10:46:53AM -0600, Alex Williamson wrote: > > > > On Wed, 10 Mar 2021 11:58:07 -0700 > > > > Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > > > > > > > > > vfio_pci_mmap_fault() incorrectly makes use of io_remap_pfn_range() > > > > > from within a vm_ops fault handler. This function will trigger a > > > > > BUG_ON if it encounters a populated pte within the remapped range, > > > > > where any fault is meant to populate the entire vma. Concurrent > > > > > inflight faults to the same vma will therefore hit this issue, > > > > > triggering traces such as: > > > > > > If it is just about concurrancy can the vma_lock enclose > > > io_remap_pfn_range() ? > > > > We could extend vma_lock around io_remap_pfn_range(), but that alone > > would just block the concurrent faults to the same vma and once we > > released them they'd still hit the BUG_ON in io_remap_pfn_range() > > because the page is no longer pte_none(). We'd need to combine that > > with something like __vfio_pci_add_vma() returning -EEXIST to skip the > > io_remap_pfn_range(), but I've been advised that we shouldn't be > > calling io_remap_pfn_range() from within the fault handler anyway, we > > should be using something like vmf_insert_pfn() instead, which I > > understand can be called safely in the same situation. That's rather > > the testing I was hoping someone who reproduced the issue previously > > could validate. > > Yes, using the vmf_ stuff is 'righter' for sure, but there isn't > really a vmf for IO mappings.. > > > > I assume there is a reason why vm_lock can't be used here, so I > > > wouldn't object, though I don't especially like the loss of tracking > > > either. > > > > There's no loss of tracking here, we were only expecting a single fault > > per vma to add the vma to our list. This just skips adding duplicates > > in these cases where we can have multiple faults in-flight. Thanks, > > I mean the arch tracking of IO maps that is hidden inside ioremap_pfn Ok, so I take it you'd feel more comfortable with something like this, right? Thanks, Alex diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 759dfb118712..74fc66cf9cf4 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1584,6 +1584,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct vfio_pci_device *vdev = vma->vm_private_data; + struct vfio_pci_mmap_vma *mmap_vma; vm_fault_t ret = VM_FAULT_NOPAGE; mutex_lock(&vdev->vma_lock); @@ -1591,24 +1592,33 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) if (!__vfio_pci_memory_enabled(vdev)) { ret = VM_FAULT_SIGBUS; - mutex_unlock(&vdev->vma_lock); goto up_out; } - if (__vfio_pci_add_vma(vdev, vma)) { - ret = VM_FAULT_OOM; - mutex_unlock(&vdev->vma_lock); - goto up_out; + /* + * Skip existing vmas, assume concurrent in-flight faults to avoid + * BUG_ON from io_remap_pfn_range() hitting !pte_none() pages. + */ + list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) { + if (mmap_vma->vma == vma) + goto up_out; } - mutex_unlock(&vdev->vma_lock); - if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, - vma->vm_end - vma->vm_start, vma->vm_page_prot)) + vma->vm_end - vma->vm_start, + vma->vm_page_prot)) { ret = VM_FAULT_SIGBUS; + goto up_out; + } + + if (__vfio_pci_add_vma(vdev, vma)) { + ret = VM_FAULT_OOM; + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); + } up_out: up_read(&vdev->memory_lock); + mutex_unlock(&vdev->vma_lock); return ret; }