Re: [Question] CoW on VM_PFNMAP vma during write fault

David Hildenbrand <david@xxxxxxxxxx> · Tue, 27 Feb 2024 14:15:51 +0100

On 27.02.24 14:00, David Hildenbrand wrote:
On 27.02.24 13:28, Wupeng Ma wrote:
We find that a warn will be produced during our test, the detail log is
shown in the end.

The core problem of this warn is that the first pfn of this pfnmap vma is
cleared during memory-failure. Digging into the source we find that this
problem can be triggered as following:

// mmap with MAP_PRIVATE and specific fd which hook mmap
mmap(MAP_PRIVATE, fd)
    __mmap_region
      remap_pfn_range
      // set vma with pfnmap and the prot of pte is read only
	

Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume.

What fd is that exactly? Often, we disallow private mappings in the
mmap() callback (for a good reason).

// memset this memory with trigger fault
handle_mm_fault
    __handle_mm_fault
      handle_pte_fault
        // write fault and !pte_write(entry)
        do_wp_page
          wp_page_copy // this will alloc a new page with valid page struct
                       // for this pfnmap vma

Here we replace the mapped PFNMAP thingy by a proper anon folio.


// inject a hwpoison to the first page of this vma

I assume this is an anon folio?

madvise_inject_error
    memory_failure
      hwpoison_user_mappings
        try_to_unmap_one
          // mark this pte as invalid (hwpoison)
          mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
                  address, range.end);

// during unmap vma, the first pfn of this pfnmap vma is invalid
vm_mmap_pgoff
    do_mmap
      __do_mmap_mm
        __mmap_region
          __do_munmap
            unmap_region
              unmap_vmas
                unmap_single_vma
                  untrack_pfn
                    follow_phys // pte is already invalidate, WARN_ON here

unmap_single_vma()->...->zap_pte_range() should do the right thing when
calling vm_normal_page().

untrack_pfn() is the problematic part.


CoW with a valid page for pfnmap vma is weird to us. Can we use
remap_pfn_range for private vma(read only)? Once CoW happens on a pfnmap
vma during write fault, this page is normal(page flag is valid) for most mm
subsystems, such as memory failure in thais case and extra should be done to
handle this special page.

During unmap, if this vma is pfnmap, unmap shouldn't be done since page
should not be touched for pfnmap vma.

But the root problem is that can we insert a valid page for pfnmap vma?

Any thoughts to solve this warn?

vm_normal_page() documentation explains how that magic is supposed to
work. vm_normal_page() should be able to correctly identify whether we
want to look at the struct page for an anon folio that was COWed.


untrack_pfn() indeed does not seem to be well prepared for handling
MAP_PRIVATE mappings where we end up having anon folios.

I think it will already *completely mess up* simply when unmapping the
range without the memory failure involved.

See, follow_phys() would get the PFN of the anon folio and then
untrack_pfn() would do some nonesense with that. Completely broken.

The WARN is just a side-effect of the brokenness.

In follow_phys(), we'd likely have to call vm_normal_page(). If we get a
page back, we'd likely have to fail follow_phys() instead of returning a
PFN of an anon folio.

Now, how do we fix untrack_pfn() ? I really don't know. In theory, we
might no longer have *any* PFNMAP PFN in there after COW'ing everything.

Sounds like MAP_PRIVATE VM_PFNMAP + __HAVE_PFNMAP_TRACKING is some
broken garbage (sorry). Can we disallow it?

Staring at track_pfn_copy(), it's maybe similarly broken?

I think we want to do:

diff --git a/mm/memory.c b/mm/memory.c
index 098356b8805ae..da5d1e37c5534 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6050,6 +6050,10 @@ int follow_phys(struct vm_area_struct *vma,
                goto out;
        pte = ptep_get(ptep);
 
+       /* Never return addresses of COW'ed anon folios. */
+       if (vm_normal_page(vma, address, pte))
+               goto unlock;
+
        if ((flags & FOLL_WRITE) && !pte_write(pte))
                goto unlock;
 

And then, just disallow it with PAT involved:

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 0904d7e8e1260..e4d2b2e8c0281 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -997,6 +997,15 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
                                && size == (vma->vm_end - vma->vm_start))) {
                int ret;
 
+               /*
+                * untrack_pfn() and friends cannot handl regions that suddenly
+                * contain anon folios after COW. In particular, follow_phys()
+                * will fail when we have an anon folio at the beginning og the
+                * VMA.
+                */
+               if (vma && is_cow_mapping(vma->vm_flags))
+                       return -EINVAL;
+
                ret = reserve_pfn_range(paddr, size, prot, 0);
                if (ret == 0 && vma)
                        vm_flags_set(vma, VM_PAT);


I'm afraid that will break something. But well, it's already semi-broken.

As long as VM_PAT is not involved, it should work as expected.

In an ideal world, we'd get rid of follow_phys() completely and just
derive that information from the VMA?

--
Cheers,

David / dhildenb