[Question] CoW on VM_PFNMAP vma during write fault

Wupeng Ma <mawupeng1@xxxxxxxxxx> · Tue, 27 Feb 2024 20:28:14 +0800

We find that a warn will be produced during our test, the detail log is
shown in the end.

The core problem of this warn is that the first pfn of this pfnmap vma is
cleared during memory-failure. Digging into the source we find that this
problem can be triggered as following:

// mmap with MAP_PRIVATE and specific fd which hook mmap
mmap(MAP_PRIVATE, fd)
  __mmap_region
    remap_pfn_range
    // set vma with pfnmap and the prot of pte is read only

// memset this memory with trigger fault
handle_mm_fault
  __handle_mm_fault
    handle_pte_fault
      // write fault and !pte_write(entry)
      do_wp_page
        wp_page_copy // this will alloc a new page with valid page struct
                     // for this pfnmap vma

// inject a hwpoison to the first page of this vma
madvise_inject_error
  memory_failure
    hwpoison_user_mappings
      try_to_unmap_one
        // mark this pte as invalid (hwpoison)
        mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
                address, range.end);

// during unmap vma, the first pfn of this pfnmap vma is invalid
vm_mmap_pgoff
  do_mmap
    __do_mmap_mm
      __mmap_region
        __do_munmap 
          unmap_region
            unmap_vmas
              unmap_single_vma
                untrack_pfn
                  follow_phys // pte is already invalidate, WARN_ON here

CoW with a valid page for pfnmap vma is weird to us. Can we use
remap_pfn_range for private vma(read only)? Once CoW happens on a pfnmap
vma during write fault, this page is normal(page flag is valid) for most mm
subsystems, such as memory failure in thais case and extra should be done to
handle this special page.

During unmap, if this vma is pfnmap, unmap shouldn't be done since page
should not be touched for pfnmap vma.

But the root problem is that can we insert a valid page for pfnmap vma?

Any thoughts to solve this warn?

------------[ cut here ]------------
WARNING: CPU: 0 PID: 503 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xed/0x100
Modules linked in: remap_pfn(OE)
CPU: 0 PID: 503 Comm: remap_pfn Tainted: G           OE      6.8.0-rc6-dirty #436
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
RIP: 0010:untrack_pfn+0xed/0x100
Code: cc cc cc cc 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 ca 48 8b 7b 30 e8 81 de cf 00 89 6b 28 48 8b 7b 30 e8 05 cc b7 e8 ba c3 ce 00 66 2e 0f 1f 84 00 00 00 00 00 90 90 90
RSP: 0018:ffffb5f683eafc78 EFLAGS: 00010282
RAX: 00000000ffffffea RBX: ffff960b18537658 RCX: 0000000000000043
RDX: bfffffffdcb13e00 RSI: 0000000000000043 RDI: ffff960e45b7a140
RBP: 0000000000000000 R08: 00007f7df7a00000 R09: ffff960a00000fb8
R10: ffff960a00000000 R11: 000fffffffffffff R12: 00007f7df7a08000
R13: 0000000000000000 R14: ffffb5f683eafdc8 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff960e2fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7df7aeb110 CR3: 0000000118b66003 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0x84/0x140
 ? untrack_pfn+0xed/0x100
 ? report_bug+0x1bd/0x1d0
 ? handle_bug+0x3c/0x70
 ? exc_invalid_op+0x18/0x70
 ? asm_exc_invalid_op+0x1a/0x20
 ? untrack_pfn+0xed/0x100
 ? untrack_pfn+0x5c/0x100
 unmap_single_vma+0xa6/0xe0
 unmap_vmas+0xb2/0x190
 exit_mmap+0xee/0x3c0
 mmput+0x68/0x120
 do_exit+0x2ec/0xb80
 do_group_exit+0x31/0x80
 __x64_sys_exit_group+0x18/0x20
 do_syscall_64+0x66/0x180
 entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f7df7aeb146
Code: Unable to access opcode bytes at 0x7f7df7aeb11c.
RSP: 002b:00007ffe571100a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007f7df7bf08a0 RCX: 00007f7df7aeb146
RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff80
R10: 0000000000000002 R11: 0000000000000246 R12: 00007f7df7bf08a0
R13: 0000000000000001 R14: 00007f7df7bf92e8 R15: 0000000000000000
 </TASK>
---[ end trace 0000000000000000 ]---