On Tue, Jun 04, 2024 at 08:07:28AM -0600, Alex Williamson wrote: > On Tue, 4 Jun 2024 12:26:29 +0800 > Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote: > > > On Fri, May 31, 2024 at 11:18:15PM -0600, Alex Williamson wrote: > > > On Sat, 1 Jun 2024 07:41:27 +0800 > > > Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote: > > > > > > > On Wed, May 29, 2024 at 10:52:31PM -0600, Alex Williamson wrote: > > > > > With the vfio device fd tied to the address space of the pseudo fs > > > > > inode, we can use the mm to track all vmas that might be mmap'ing > > > > > device BARs, which removes our vma_list and all the complicated lock > > > > > ordering necessary to manually zap each related vma. > > > > > > > > > > Note that we can no longer store the pfn in vm_pgoff if we want to use > > > > > unmap_mapping_range() to zap a selective portion of the device fd > > > > > corresponding to BAR mappings. > > > > > > > > > > This also converts our mmap fault handler to use vmf_insert_pfn() > > > > > because we no longer have a vma_list to avoid the concurrency problem > > > > > with io_remap_pfn_range(). The goal is to eventually use the vm_ops > > > > > huge_fault handler to avoid the additional faulting overhead, but > > > > > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first. > > > > > > > > > Do we also consider looped vmf_insert_pfn() in mmap fault handler? e.g. > > > > for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) { > > > > offset = (i - vma->vm_start) >> PAGE_SHIFT; > > > > ret = vmf_insert_pfn(vma, i, base_pfn + offset); > > > > if (ret != VM_FAULT_NOPAGE) { > > > > zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); > > > > goto up_out; > > > > } > > > > } > > > > > What about the prefault version? e.g. > > > > ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff); > > + if (ret & VM_FAULT_ERROR) > > + goto out_disabled; > > + > > + /* prefault */ > > + for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE, pfn++) { > > + if (i == vmf->address) > > + continue; > > + > > + /* Don't return error on prefault */ > > + if (vmf_insert_pfn(vma, i, pfn) & VM_FAULT_ERROR) > > + break; > > + } > > + > > out_disabled: > > up_read(&vdev->memory_lock); > > > > > > > We can have concurrent faults, so doing this means that we need to > > > continue to maintain a locked list of vmas that have faulted to avoid > > But looks vfio_pci_zap_bars() always zap full PCI BAR ranges for vmas in > > core_vdev->inode->i_mapping. > > So, it doesn't matter whether a vma is fully mapped or partly mapped? > > Yes, but without locking concurrent faults would all be re-inserting > the pfns concurrently from the fault handler. insert_pfn() holds a ptl lock with get_locked_pte() and it will return VM_FAULT_NOPAGE directly without re-inserting if !pte_none() is found, right? > > > > duplicate insertions and all the nasty lock gymnastics that go along > > > with it. I'd rather we just support huge_fault. Thanks, > > huge_fault is better. But before that, is this prefault one good? > > > > It seems like this would still be subject to the race that Jason noted > here[1], ie. call_mmap() occurs before vma_link_file(), therefore we > need to exclusively rely on fault to populate the vma or we risk a race > with unmap_mapping_range(). Thanks, > Thanks. But the proposed prefault way is also in the fault handler, which is after the vma_link_file(). Similar implementations can be found at drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c. > > > [1]https://lore.kernel.org/all/20240522183046.GG20229@xxxxxxxxxx/ >