Re: [PATCH] mm/x86/pat: Only untrack the pfn range if unmap region

David Hildenbrand <david@xxxxxxxxxx> · Wed, 17 Jul 2024 16:17:12 +0200

On 16.07.24 21:01, Peter Xu wrote:
On Tue, Jul 16, 2024 at 05:13:29PM +0800, Yan Zhao wrote:
On Mon, Jul 15, 2024 at 10:29:59PM +0800, Peter Xu wrote:
On Mon, Jul 15, 2024 at 03:08:58PM +0800, Yan Zhao wrote:
On Fri, Jul 12, 2024 at 10:42:44AM -0400, Peter Xu wrote:
This patch is one patch of an old series [1] that got reposted standalone
here, with the hope to fix some reported untrack_pfn() issues reported
recently [2,3], where there used to be other fix [4] but unfortunately
which looks like to cause other issues.  The hope is this patch can fix it
the right way.

X86 uses pfn tracking to do pfnmaps.  AFAICT, the tracking should normally
start at mmap() of device drivers, then untracked when munmap().  However
in the current code the untrack is done in unmap_single_vma().  This might
be problematic.

For example, unmap_single_vma() can be used nowadays even for zapping a
single page rather than the whole vmas.  It's very confusing to do whole
vma untracking in this function even if a caller would like to zap one
page.  It could simply be wrong.

Such issue won't be exposed by things like MADV_DONTNEED won't ever work
for pfnmaps and it'll fail the madvise() already before reaching here.
However looks like it can be triggered like what was reported where invoked
from an unmap request from a file vma.

There's also work [5] on VFIO (merged now) to allow tearing down MMIO
pgtables before an munmap(), in which case we may not want to untrack the
pfns if we're only tearing down the pgtables.  IOW, we may want to keep the
pfn tracking information as those pfn mappings can be restored later with
the same vma object.  Currently it's not an immediate problem for VFIO, as
VFIO uses UC- by default, but it looks like there's plan to extend that in
the near future.

IIUC, this was overlooked when zap_page_range_single() was introduced,
while in the past it was only used in the munmap() path which wants to
always unmap the region completely.  E.g., commit f5cc4eef9987 ("VM: make
zap_page_range() callers that act on a single VMA use separate helper") is
the initial commit that introduced unmap_single_vma(), in which the chunk
of untrack_pfn() was moved over from unmap_vmas().

Recover that behavior to untrack pfnmap only when unmap regions.

[1] https://lore.kernel.org/r/20240523223745.395337-1-peterx@xxxxxxxxxx
[2] https://groups.google.com/g/syzkaller-bugs/c/FeQZvSbqWbQ/m/tHFmoZthAAAJ
[3] https://lore.kernel.org/r/20240712131931.20207-1-00107082@xxxxxxx
[4] https://lore.kernel.org/all/20240710-bug12-v1-1-0e5440f9b8d3@xxxxxxxxx/
[5] https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@xxxxxxxxxx

Cc: Alex Williamson <alex.williamson@xxxxxxxxxx>
Cc: Jason Gunthorpe <jgg@xxxxxxxxxx>
Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: Andy Lutomirski <luto@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxxxx>
Cc: Kirill A. Shutemov <kirill@xxxxxxxxxxxxx>
Cc: x86@xxxxxxxxxx
Cc: Yan Zhao <yan.y.zhao@xxxxxxxxx>
Cc: Kevin Tian <kevin.tian@xxxxxxxxx>
Cc: Pei Li <peili.dev@xxxxxxxxx>
Cc: David Hildenbrand <david@xxxxxxxxxx>
Cc: David Wang <00107082@xxxxxxx>
Cc: Bert Karwatzki <spasswolf@xxxxxx>
Cc: Sergey Senozhatsky <senozhatsky@xxxxxxxxxxxx>
Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
---

NOTE: I massaged the commit message comparing to the rfc post [1], the
patch itself is untouched.  Also removed rfc tag, and added more people
into the loop. Please kindly help test this patch if you have a reproducer,
as I can't reproduce it myself even with the syzbot reproducer on top of
mm-unstable.  Instead of further check on the reproducer, I decided to send
this out first as we have a bunch of reproducers on the list now..
---
  mm/memory.c | 5 ++---
  1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4bcd79619574..f57cc304b318 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
  	if (vma->vm_file)
  		uprobe_munmap(vma, start, end);
  
-	if (unlikely(vma->vm_flags & VM_PFNMAP))
-		untrack_pfn(vma, 0, 0, mm_wr_locked);
-
Specifically to VFIO's case, looks it doesn't matter if untrack_pfn() is
called here, since remap_pfn_range() is not called in mmap() and fault
handler, and therefore (vma->vm_flags & VM_PAT) is always 0.

Right when with current repo, but I'm thinking maybe we should have VM_PAT
there..
Yes, I agree.

But, currently for VFIO, it cannot call io_remap_pfn_range() in the fault
handler since vm_flags_set() requires mmap lock held for write while
the fault handler can only hold mmap lock for read.
So, it relies on ioremap()/iounmap() to reserve/de-reserve memtypes,
without VM_PAT being set in vma.

Right, neither vm_flags_set() nor io_remap_pfn_range() should be called in
a fault handler.  They should only be called in mmap() phase.

When you mentioned ioremap(), it's only for the VGA device, right?  I don't
see it's used in the vfio-pci's fault handler.



See what reserve_pfn_range() does right now: it'll make sure only one owner
reserve this area, e.g. memtype_reserve() will fail if it has already been
reserved.  Then when succeeded as the first one to reserve the region,
it'll make sure this mem type to also be synchronized to the kernel map
(memtype_kernel_map_sync).

So I feel like when MMIO disabled for a VFIO card, we may need to call
reserve_pfn_range(), telling the kernel the mem type VFIO uses, even if the
pgtable will be empty, and even if currently it's always UC- for now:

vfio_pci_core_mmap():
	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

Then the memtype will be reserved even if it cannot be used.  Otherwise I
don't know whether there's side effects of kernel identity mapping where it
mismatch later with what's mapped in the userspace via the vma, when MMIO
got enabled again: pgtable started to be injected with a different memtype
that specified only in the vma's pgprot.
Current VFIO relies on pci_iomap() to reserve MMIO pfns as UC-, thus
no VM_PAT in vmas.
  
Calling remap_pfn_range() in the mmap() will cause problem to support
dynamic faulting. Looks there's still a window even with an immediate
unmap after mmap().

It can be conditionally injected.  See Alex's commit (not yet posted
anywhere, only used in our internal testing so far):

https://github.com/xzpeter/linux/commit/f432e0e46c34e493943034be4cb9d6eb638f57d1


Also, calling remap_pfn_range() in mmap() may not meet the requirement of
drivers to dynamic switch backend resources, e.g. as what's in
cxl_mmap_fault(), since one remap_pfn_range() cannot reserve memtypes for
all backend resources at once.

So, is there any way for the driver to reserve memtypes and set VM_PAT in
fault handler?

I must confess I'm not familiar with memtype stuff, and just started
looking recently.

Per my reading so far, we have these multiple ways of doing memtype
reservations, and no matter which API we use (ioremap / track_pfn_remap /
pci_iomap), the same memtype needs to be used otherwise the reservation
will fail.  Here ioremap / pci_iomap will involve one more vmap so that the
virtual mapping will present already after the API returns.

Then here IMHO track_pfn_remap() is the one that we should still use in
page-level mmap() controls, because so far we don't want to map any MMIO
range yet, however we want to say "hey this VMA maps something special", by
reserving these memtype and set VM_PAT.  We want the virtual mapping only
later during a fault().

In short, it looks to me the right thing we should do is to manually invoke
track_pfn_remap() in vfio-pci's mmap() hook.

	if (!vdev->pm_runtime_engaged && __vfio_pci_memory_enabled(vdev))
		ret = remap_pfn_range(vma, vma->vm_start, pfn,
				      req_len, vma->vm_page_prot);
	else
                 // TODO: manually invoke track_pfn_remap() here
		vm_flags_set(vma, VM_IO | VM_PFNMAP |
				  VM_DONTEXPAND | VM_DONTDUMP);

Then the vma is registered, and untrack_pfn() should be automatically done
when vma unmaps (and that relies on this patch to not do that too early).
 From that POV, I still think this patch does the right thing and should be
merged, even if we probably have one more issue to fix as David Wang
reported..

Especially if it only applies to OOT drivers. Do we have report from 
in-tree drivers?

I'd be curious how a previous truncation on a file can tear down a 
PFNMAP mapping -- and if there are simple ways to forbid that (if possible).

--
Cheers,

David / dhildenb