Re: [PATCH v2 2/2] vfio/pci: Use unmap_mapping_range()

Alex Williamson <alex.williamson@xxxxxxxxxx> · Tue, 4 Jun 2024 08:07:28 -0600

On Tue, 4 Jun 2024 12:26:29 +0800
Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:

> On Fri, May 31, 2024 at 11:18:15PM -0600, Alex Williamson wrote:
> > On Sat, 1 Jun 2024 07:41:27 +0800
> > Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
> >   
> > > On Wed, May 29, 2024 at 10:52:31PM -0600, Alex Williamson wrote:  
> > > > With the vfio device fd tied to the address space of the pseudo fs
> > > > inode, we can use the mm to track all vmas that might be mmap'ing
> > > > device BARs, which removes our vma_list and all the complicated lock
> > > > ordering necessary to manually zap each related vma.
> > > > 
> > > > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > > > unmap_mapping_range() to zap a selective portion of the device fd
> > > > corresponding to BAR mappings.
> > > > 
> > > > This also converts our mmap fault handler to use vmf_insert_pfn()
> > > > because we no longer have a vma_list to avoid the concurrency problem
> > > > with io_remap_pfn_range().  The goal is to eventually use the vm_ops
> > > > huge_fault handler to avoid the additional faulting overhead, but
> > > > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
> > > >    
> > > Do we also consider looped vmf_insert_pfn() in mmap fault handler? e.g.
> > > for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) {
> > > 	offset = (i - vma->vm_start) >> PAGE_SHIFT;
> > > 	ret = vmf_insert_pfn(vma, i, base_pfn + offset);
> > > 	if (ret != VM_FAULT_NOPAGE) {
> > > 		zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> > > 		goto up_out;
> > > 	}
> > > }  
> >  
> What about the prefault version? e.g.
> 
>         ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
> +       if (ret & VM_FAULT_ERROR)
> +               goto out_disabled;
> +
> +       /* prefault */
> +       for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE, pfn++) {
> +               if (i == vmf->address)
> +                       continue;
> +
> +               /* Don't return error on prefault */
> +               if (vmf_insert_pfn(vma, i, pfn) & VM_FAULT_ERROR)
> +                       break;
> +       }
> +
>  out_disabled:
>         up_read(&vdev->memory_lock);
> 
> 
> > We can have concurrent faults, so doing this means that we need to
> > continue to maintain a locked list of vmas that have faulted to avoid  
> But looks vfio_pci_zap_bars() always zap full PCI BAR ranges for vmas in
> core_vdev->inode->i_mapping.
> So, it doesn't matter whether a vma is fully mapped or partly mapped?

Yes, but without locking concurrent faults would all be re-inserting
the pfns concurrently from the fault handler.

> > duplicate insertions and all the nasty lock gymnastics that go along
> > with it.  I'd rather we just support huge_fault.  Thanks,  
> huge_fault is better. But before that, is this prefault one good?
> 

It seems like this would still be subject to the race that Jason noted
here[1], ie. call_mmap() occurs before vma_link_file(), therefore we
need to exclusively rely on fault to populate the vma or we risk a race
with unmap_mapping_range().  Thanks,

Alex

[1]https://lore.kernel.org/all/20240522183046.GG20229@xxxxxxxxxx/