On Wed, Jul 17, 2024 at 10:15:24PM +0800, Peter Xu wrote: > On Wed, Jul 17, 2024 at 09:38:34AM +0800, Yan Zhao wrote: > > On Tue, Jul 16, 2024 at 03:01:50PM -0400, Peter Xu wrote: > > > On Tue, Jul 16, 2024 at 05:13:29PM +0800, Yan Zhao wrote: > > > > On Mon, Jul 15, 2024 at 10:29:59PM +0800, Peter Xu wrote: > > > > > On Mon, Jul 15, 2024 at 03:08:58PM +0800, Yan Zhao wrote: > > > > > > On Fri, Jul 12, 2024 at 10:42:44AM -0400, Peter Xu wrote: ... > > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > > > > index 4bcd79619574..f57cc304b318 100644 > > > > > > > --- a/mm/memory.c > > > > > > > +++ b/mm/memory.c > > > > > > > @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb, > > > > > > > if (vma->vm_file) > > > > > > > uprobe_munmap(vma, start, end); > > > > > > > > > > > > > > - if (unlikely(vma->vm_flags & VM_PFNMAP)) > > > > > > > - untrack_pfn(vma, 0, 0, mm_wr_locked); > > > > > > > - > > > > > > Specifically to VFIO's case, looks it doesn't matter if untrack_pfn() is > > > > > > called here, since remap_pfn_range() is not called in mmap() and fault > > > > > > handler, and therefore (vma->vm_flags & VM_PAT) is always 0. > > > > > > > > > > Right when with current repo, but I'm thinking maybe we should have VM_PAT > > > > > there.. > > > > Yes, I agree. > > > > > > > > But, currently for VFIO, it cannot call io_remap_pfn_range() in the fault > > > > handler since vm_flags_set() requires mmap lock held for write while > > > > the fault handler can only hold mmap lock for read. > > > > So, it relies on ioremap()/iounmap() to reserve/de-reserve memtypes, > > > > without VM_PAT being set in vma. > > > > > > Right, neither vm_flags_set() nor io_remap_pfn_range() should be called in > > > a fault handler. They should only be called in mmap() phase. > > > > > > When you mentioned ioremap(), it's only for the VGA device, right? I don't > > > see it's used in the vfio-pci's fault handler. > > Oh, it's pci_iomap() in the vfio-pci's fault handler, another version of > > ioremap() :) > > Right. If to be explicit, I think it's in mmap(), and looks like Alex has a Yes, in mmap(). (The "fault handler" in the previous reply is a typo :)) > comment for that: > > /* > * Even though we don't make use of the barmap for the mmap, > * we need to request the region and the barmap tracks that. > */ > > Maybe in 2012 that's a must? I think ioremap/pci_iomap in mmap() is not a must. At least it's not there in nvgrace_gpu_mmap(). But nvgrace_gpu_mmap() has remap_pfn_range() and Alex explained that nvgrace-gpu "only exists on ARM platforms. memtype_reserve() only exists on x86." [1]. [1] https://lore.kernel.org/all/20240529105012.39756143.alex.williamson@xxxxxxxxxx/ > PS: when looking, that looks like a proper > place to reuse vfio_pci_core_setup_barmap() also in the mmap() function. Yes, look so. > > > > > > > > > > > > > > > > See what reserve_pfn_range() does right now: it'll make sure only one owner > > > > > reserve this area, e.g. memtype_reserve() will fail if it has already been > > > > > reserved. Then when succeeded as the first one to reserve the region, > > > > > it'll make sure this mem type to also be synchronized to the kernel map > > > > > (memtype_kernel_map_sync). > > > > > > > > > > So I feel like when MMIO disabled for a VFIO card, we may need to call > > > > > reserve_pfn_range(), telling the kernel the mem type VFIO uses, even if the > > > > > pgtable will be empty, and even if currently it's always UC- for now: > > > > > > > > > > vfio_pci_core_mmap(): > > > > > vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); > > > > > > > > > > Then the memtype will be reserved even if it cannot be used. Otherwise I > > > > > don't know whether there's side effects of kernel identity mapping where it > > > > > mismatch later with what's mapped in the userspace via the vma, when MMIO > > > > > got enabled again: pgtable started to be injected with a different memtype > > > > > that specified only in the vma's pgprot. > > > > Current VFIO relies on pci_iomap() to reserve MMIO pfns as UC-, thus > > > > no VM_PAT in vmas. > > > > > > > > Calling remap_pfn_range() in the mmap() will cause problem to support > > > > dynamic faulting. Looks there's still a window even with an immediate > > > > unmap after mmap(). > > > > > > It can be conditionally injected. See Alex's commit (not yet posted > > > anywhere, only used in our internal testing so far): > > > > > > https://github.com/xzpeter/linux/commit/f432e0e46c34e493943034be4cb9d6eb638f57d1 > > > > > I'm a bit confused by the code. > > It looks that it will do the remap_pfn_range() in mmap() if hardware is ready, > > and will just set vma flags if hardware is not ready. > > > > But if the hardware is not ready in mmap(), which code in the fault handler > > will reserve pfn memtypes? > > I thought I answered that below.. :) I think we should use track_pfn_remap(). Ok. Then if we have two sets of pfns, then we can 1. Call remap_pfn_range() in mmap() for pfn set 1. 2. Export track_pfn_remap() and call track_pfn_remap() in mmap() for pfn set 2. 3. Unmap and call vmf_insert_pfn() in the fault handler to map pfn set 2. However, I'm not sure if we can properly untrack both pfn sets 1 and 2. By exporting untrack_pfn() too? Or, you'll leave VFIO to use ioremap/iounmap() (or the variants) to reserve memtype by itself? > > > > > > > > > > > Also, calling remap_pfn_range() in mmap() may not meet the requirement of > > > > drivers to dynamic switch backend resources, e.g. as what's in > > > > cxl_mmap_fault(), since one remap_pfn_range() cannot reserve memtypes for > > > > all backend resources at once. > > > > > > > > So, is there any way for the driver to reserve memtypes and set VM_PAT in > > > > fault handler? > > > > > > I must confess I'm not familiar with memtype stuff, and just started > > > looking recently. > > > > > > Per my reading so far, we have these multiple ways of doing memtype > > > reservations, and no matter which API we use (ioremap / track_pfn_remap / > > > pci_iomap), the same memtype needs to be used otherwise the reservation > > > will fail. Here ioremap / pci_iomap will involve one more vmap so that the > > > virtual mapping will present already after the API returns. > > Right, I understand in the same way :) > > > > > > > > Then here IMHO track_pfn_remap() is the one that we should still use in > > > page-level mmap() controls, because so far we don't want to map any MMIO > > > range yet, however we want to say "hey this VMA maps something special", by > > > reserving these memtype and set VM_PAT. We want the virtual mapping only > > > later during a fault(). > > > > > > In short, it looks to me the right thing we should do is to manually invoke > > > track_pfn_remap() in vfio-pci's mmap() hook. > > > > > > if (!vdev->pm_runtime_engaged && __vfio_pci_memory_enabled(vdev)) > > > ret = remap_pfn_range(vma, vma->vm_start, pfn, > > > req_len, vma->vm_page_prot); > > > else > > > // TODO: manually invoke track_pfn_remap() here > > > vm_flags_set(vma, VM_IO | VM_PFNMAP | > > > VM_DONTEXPAND | VM_DONTDUMP); > > What if we have to remap another set of pfns in the "else" case? > > Use vmf_insert_pfn*(): these helpers do not reserve memtype but looks them > up only. > > > > > > > > > Then the vma is registered, and untrack_pfn() should be automatically done > > > when vma unmaps (and that relies on this patch to not do that too early). > > Well, I'm concerned by one use case. > > > > 1. The driver calls remap_pfn_range() to reserve memtype for pfn1 + add > > VM_PAT flag. > > 2. Then unmap_single_vma() is called. With this patch, the pfn1 memtype is > > still reserved. > > 3. The fault handler calls vmf_insert_pfn() for another pfn2. > > 4. unmap_vmas() is called. However, untrack_pfn() will only find pfn2 > > instead of prevous pfn1. > > It depends on what's your exact question, if it's about pgtable: I don't > think munmap() requires PFNMAP pgtables to always exist, and it'll simply > skip none entries. > > And if it's about PAT tracking: that is exactly what I'm talking about > below.. where I think untracking shouldn't rely on pgtables. > I'll copy you too if I'll prepare some patches for the latter. With that > patchset (plus this patch) it should fix David Wang's issue and similar, > AFAICT. Thank you! > > > > > > > > From that POV, I still think this patch does the right thing and should be > > > merged, even if we probably have one more issue to fix as David Wang > > > reported.. > > > > > > I'll need to think about how to do that right, as I think that'll be needed > > > as long as pfnmaps will support fault()s: it means when munmap() the > > > pgtable may not present, and PAT cannot rely on walking the pgtable to know > > > the base PFN anymore. > > > > > > Thanks, > > > > > > -- > > > Peter Xu > > > > > > > -- > Peter Xu > >