[BUG] page table UAF, Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Aug 30, 2024 at 6:00 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
> Instead of zeroing the vma tree and then overwriting the area, let the
> area be overwritten and then clean up the gathered vmas using
> vms_complete_munmap_vmas().
>
> To ensure locking is downgraded correctly, the mm is set regardless of
> MAP_FIXED or not (NULL vma).
>
> If a driver is mapping over an existing vma, then clear the ptes before
> the call_mmap() invocation.  This is done using the vms_clean_up_area()
> helper.  If there is a close vm_ops, that must also be called to ensure
> any cleanup is done before mapping over the area.  This also means that
> calling open has been added to the abort of an unmap operation, for now.

As currently implemented, this is not a valid optimization because it
violates the (unwritten?) rule that you must not call free_pgd_range()
on a region in the page tables which can concurrently be walked. A
region in the page tables can be concurrently walked if it overlaps a
VMA which is linked into rmaps which are not write-locked.

On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and
the new mapping is created by expanding an adjacent VMA, the following
race with an ftruncate() is possible (because page tables for the old
mapping are removed while the new VMA in the same location is already
fully set up and linked into the rmap):


task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
========================     ==================
mmap_region
  vma_merge_new_range
    vma_expand
      commit_merge
        vma_prepare
          [take rmap locks]
        vma_set_range
          [expand adjacent mapping]
        vma_complete
          [drop rmap locks]
  vms_complete_munmap_vmas
    vms_clear_ptes
      unmap_vmas
        [removes ptes]
      free_pgtables
        [unlinks old vma from rmap]
                             unmap_mapping_range
                               unmap_mapping_pages
                                 i_mmap_lock_read
                                 unmap_mapping_range_tree
                                   [loop]
                                     unmap_mapping_range_vma
                                       zap_page_range_single
                                         unmap_single_vma
                                           unmap_page_range
                                             zap_p4d_range
                                               zap_pud_range
                                                 zap_pmd_range
                                                   [looks up pmd entry]
        free_pgd_range
          [frees pmd]
                                                   [UAF pmd entry access]

To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
to widen the race windows, then build and run the attached reproducer
mmap-fixed-race.c.

Under a kernel with KASAN, you should ideally get a KASAN splat like this:

[   90.012655][ T1113]
==================================================================
[   90.013937][ T1113] BUG: KASAN: use-after-free in
unmap_page_range+0x2d4e/0x3a80
[   90.015136][ T1113] Read of size 8 at addr ffff888006a3b000 by task
SLOWME2/1113
[   90.016322][ T1113]
[   90.016695][ T1113] CPU: 12 UID: 1000 PID: 1113 Comm: SLOWME2
Tainted: G        W          6.12.0-rc2-dirty #500
[   90.018355][ T1113] Tainted: [W]=WARN
[   90.018959][ T1113] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   90.020598][ T1113] Call Trace:
[   90.021126][ T1113]  <TASK>
[   90.021590][ T1113]  dump_stack_lvl+0x53/0x70
[   90.022307][ T1113]  print_report+0xce/0x670
[   90.023008][ T1113]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[   90.023942][ T1113]  ? unmap_page_range+0x2d4e/0x3a80
[   90.024763][ T1113]  kasan_report+0xe2/0x120
[   90.025468][ T1113]  ? unmap_page_range+0x2d4e/0x3a80
[   90.026293][ T1113]  unmap_page_range+0x2d4e/0x3a80
[   90.027087][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.027855][ T1113]  ? next_uptodate_folio+0x148/0x890
[   90.029299][ T1113]  ? set_pte_range+0x265/0x6c0
[   90.030058][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.030826][ T1113]  ? page_table_check_set.part.0+0x2ab/0x3e0
[   90.031773][ T1113]  ? __pfx_unmap_page_range+0x10/0x10
[   90.032622][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.033394][ T1113]  ? unmap_single_vma+0xc6/0x2c0
[   90.034211][ T1113]  zap_page_range_single+0x28a/0x4b0
[   90.035052][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.035821][ T1113]  ? __pfx_zap_page_range_single+0x10/0x10
[   90.036739][ T1113]  ? __pte_offset_map+0x1d/0x250
[   90.037528][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.038295][ T1113]  ? do_fault+0x6c4/0x1270
[   90.038999][ T1113]  ? __pfx___handle_mm_fault+0x10/0x10
[   90.039862][ T1113]  unmap_mapping_range+0x1b2/0x240
[   90.040671][ T1113]  ? __pfx_unmap_mapping_range+0x10/0x10
[   90.041563][ T1113]  ? setattr_prepare+0xed/0x7e0
[   90.042330][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.043097][ T1113]  ? current_time+0x88/0x200
[   90.043826][ T1113]  shmem_setattr+0x880/0xea0
[   90.044556][ T1113]  notify_change+0x42b/0xea0
[   90.045913][ T1113]  ? do_truncate+0x10b/0x1b0
[   90.046641][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.047407][ T1113]  do_truncate+0x10b/0x1b0
[   90.048107][ T1113]  ? __pfx_do_truncate+0x10/0x10
[   90.048890][ T1113]  ? __set_task_comm+0x53/0x140
[   90.049668][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.050472][ T1113]  ? __do_sys_prctl+0x71e/0x11f0
[   90.051262][ T1113]  do_ftruncate+0x334/0x470
[   90.051976][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.052745][ T1113]  do_sys_ftruncate+0x3d/0x80
[   90.053493][ T1113]  do_syscall_64+0x4b/0x110
[   90.054209][ T1113]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   90.055142][ T1113] RIP: 0033:0x7f89cbf2bf47
[   90.055844][ T1113] Code: 77 01 c3 48 8b 15 b9 1e 0d 00 f7 d8 64 89
02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 b8 4d
00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 89 1e 0d 00 f7 d8
64 89 02 b8
[   90.058924][ T1113] RSP: 002b:00007fff96968aa8 EFLAGS: 00000213
ORIG_RAX: 000000000000004d
[   90.060248][ T1113] RAX: ffffffffffffffda RBX: 00007fff96968c28
RCX: 00007f89cbf2bf47
[   90.061507][ T1113] RDX: 0000000000000000 RSI: 0000000000001000
RDI: 0000000000000003
[   90.063471][ T1113] RBP: 00007fff96968b10 R08: 0000000000000000
R09: 00007f89cbe29740
[   90.064727][ T1113] R10: 00007f89cbefb443 R11: 0000000000000213
R12: 0000000000000000
[   90.065990][ T1113] R13: 00007fff96968c38 R14: 000055ce5a58add8
R15: 00007f89cc05f020
[   90.067291][ T1113]  </TASK>
[   90.067772][ T1113]
[   90.068147][ T1113] The buggy address belongs to the physical page:
[   90.069168][ T1113] page: refcount:0 mapcount:0
mapping:0000000000000000 index:0xffff888006a3bf50 pfn:0x6a3b
[   90.070741][ T1113] flags: 0x100000000000000(node=0|zone=1)
[   90.071649][ T1113] raw: 0100000000000000 ffffea0000111748
ffffea000020c488 0000000000000000
[   90.073009][ T1113] raw: ffff888006a3bf50 0000000000000000
00000000ffffffff 0000000000000000
[   90.074368][ T1113] page dumped because: kasan: bad access detected
[   90.075382][ T1113]
[   90.075751][ T1113] Memory state around the buggy address:
[   90.076636][ T1113]  ffff888006a3af00: 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00
[   90.077905][ T1113]  ffff888006a3af80: 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00
[   90.079874][ T1113] >ffff888006a3b000: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[   90.081173][ T1113]                    ^
[   90.081821][ T1113]  ffff888006a3b080: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[   90.083134][ T1113]  ffff888006a3b100: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[   90.084416][ T1113]
==================================================================
diff --git a/mm/memory.c b/mm/memory.c
index 2366578015ad..b95a43221058 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -78,6 +78,7 @@
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/sysctl.h>
+#include <linux/delay.h>
 
 #include <trace/events/kmem.h>
 
@@ -410,6 +411,13 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 				unlink_file_vma_batch_add(&vb, vma);
 			}
 			unlink_file_vma_batch_final(&vb);
+
+			if (strcmp(current->comm, "SLOWME1") == 0) {
+				pr_warn("%s: starting delay\n", __func__);
+				mdelay(2000);
+				pr_warn("%s: ending delay\n", __func__);
+			}
+
 			free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
 		}
@@ -1711,6 +1719,13 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	unsigned long next;
 
 	pmd = pmd_offset(pud, addr);
+
+	if (strcmp(current->comm, "SLOWME2") == 0) {
+		pr_warn("%s: starting delay\n", __func__);
+		mdelay(2000);
+		pr_warn("%s: ending delay\n", __func__);
+	}
+
 	do {
 		next = pmd_addr_end(addr, end);
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
#define _GNU_SOURCE
#include <err.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/prctl.h>
#include <sys/wait.h>
#include <sys/mman.h>

#define SYSCHK(x) ({          \
  typeof(x) __res = (x);      \
  if (__res == (typeof(x))-1) \
    err(1, "SYSCHK(" #x ")"); \
  __res;                      \
})

int main(void) {
  int fd = SYSCHK(memfd_create("foo", 0));
  SYSCHK(ftruncate(fd, 0x2000));
  char *map1 = SYSCHK(mmap((void*)(0x40000000-0x1000), 0x2000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED_NOREPLACE, fd, 0));
  ((volatile char *)map1)[0] = 'A';
  ((volatile char *)map1)[0x1000] = 'B';

  char *map2 = map1 + 0x1000;
  SYSCHK(mmap(map2, 0x1000, PROT_READ, MAP_SHARED|MAP_FIXED, fd, 0));
  ((volatile char *)map2)[0];
  //SYSCHK(mprotect(map2, 0x1000, PROT_READ));

  /*
  int pipefds[2];
  SYSCHK(pipe(pipefds));
  */

  if (SYSCHK(fork()) == 0) {
    SYSCHK(munmap(map1, 0x2000));
    sleep(1);
    SYSCHK(prctl(PR_SET_NAME, "SLOWME2"));
    SYSCHK(ftruncate(fd, 0x1000)); // rmap walk
    SYSCHK(prctl(PR_SET_NAME, "task2"));
    exit(0);
  }

  //system("cat /proc/$PPID/maps");
  SYSCHK(prctl(PR_SET_NAME, "SLOWME1"));
  SYSCHK(mmap(map2, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, fd, 0x1000));
  SYSCHK(prctl(PR_SET_NAME, "task1"));
  //system("cat /proc/$PPID/maps");

  /*
  for (int i=0; i<8; i++) {
    char buf[0x1000];
    memset(buf, 'A', 0x1000);
    SYSCHK(write(pipefds[1], buf, 0x1000));
  }
  */

  int wstatus;
  SYSCHK(wait(&wstatus));
  return 0;
}

[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux