Re: WARNING in try_grab_page

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Thu, 3 Aug 2023 14:19:20 +0100




On Thu, Aug 03, 2023 at 04:56:03PM +0800, Yikebaer Aizezi wrote:
> console output:
> https://drive.google.com/file/d/1Lq71bFwtEDix82PEf_193CLG6uh1Pjj9/view?usp=drive_link

I dug through this, and what I found troubles me.

 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 13067 at mm/gup.c:229 try_grab_page+0x2dd/0x3a0
 Modules linked in:
 CPU: 0 PID: 13067 Comm: syz-executor Tainted: G    B              6.5.0-rc2 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
 RIP: 0010:try_grab_page+0x2dd/0x3a0
 Code: ff be 04 00 00 00 4c 89 e7 e8 cf fa 13 00 f0 41 ff 04 24 e8 65 96 cb ff 45 31 e4 5b 44 89 e0 5d 41 5c 41 5d c3 e8 53 96 cb ff <0f> 0b e8 4c 96 cb ff 41 bc f4 ff ff ff 5b 44 89 e0 5d 41 5c 41 5d
 RSP: 0018:ffffc9000c2777e0 EFLAGS: 00010212
 RAX: 0000000000000247 RBX: ffffea00003ae340 RCX: ffffc90002bb1000
 RDX: 0000000000040000 RSI: ffffffff81ad81ed RDI: ffffea00003ae374
 RBP: ffffea00003ae340 R08: 0000000000000000 R09: fffff94000075c6e
 R10: ffffea00003ae377 R11: 0000000000084001 R12: ffffea00003ae374
 R13: 0000000000210002 R14: ffffea00003ae340 R15: 000000000eb8d225
 FS:  00007f5841a13640(0000) GS:ffff888063e00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000500310 CR3: 0000000018d0c000 CR4: 0000000000750ef0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  <TASK>
  ? __warn+0xe2/0x340
  ? try_grab_page+0x2dd/0x3a0
  ? report_bug+0x25d/0x460
  ? handle_bug+0x3c/0x70
  ? exc_invalid_op+0x14/0x40
  ? asm_exc_invalid_op+0x16/0x20
  ? try_grab_page+0x2dd/0x3a0
  ? try_grab_page+0x2dd/0x3a0
  follow_page_pte+0x18c/0x1610
  ? try_grab_page+0x3a0/0x3a0
  ? rcu_is_watching+0xe/0xb0
  follow_page_mask+0x2e4/0xbd0
  __get_user_pages+0x3fa/0xcf0
  ? follow_page_mask+0xbd0/0xbd0
  ? down_read_killable+0x146/0x4f0
  ? down_read_interruptible+0x4f0/0x4f0
  ? rcu_is_watching+0xe/0xb0
  __gup_longterm_locked+0x5fa/0x1ec0
  ? io_schedule_timeout+0x150/0x150
  ? rcu_is_watching+0xe/0xb0
  ? get_user_pages_unlocked+0x580/0x580
  ? lock_release+0x4f7/0x670
  ? internal_get_user_pages_fast+0xe27/0x2690
  ? lock_downgrade+0x690/0x690
  ? preempt_schedule_common+0x45/0xb0
  ? pud_huge+0x9c/0xe0
  ? pmd_huge+0xe0/0xe0
  internal_get_user_pages_fast+0x119b/0x2690
  ? mtree_load+0x1df/0x980
  ? __gup_device_huge+0x530/0x530
  ? rcu_is_watching+0xe/0xb0
  ? lock_release+0x4f7/0x670
  get_user_pages_fast+0x95/0xe0
  ? get_user_pages_fast_only+0xe0/0xe0
  do_get_mempolicy+0x50c/0xd20
  ? sp_delete+0xf0/0xf0
  ? seccomp_notify_ioctl+0xd80/0xd80
  __x64_sys_get_mempolicy+0x187/0x2a0
  ? __ia32_sys_migrate_pages+0xf0/0xf0
  ? __secure_computing+0x1ff/0x360
  do_syscall_64+0x35/0xb0
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0033:0x47959d
 Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b4 ff ff ff f7 d8 64 89 01 48
 RSP: 002b:00007f5841a13068 EFLAGS: 00000246 ORIG_RAX: 00000000000000ef
 RAX: ffffffffffffffda RBX: 000000000059c0a0 RCX: 000000000047959d
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: 000000000059c0a0 R08: 0000000000000003 R09: 0000000000000000
 R10: 0000000020ff9000 R11: 0000000000000246 R12: 000000000059c0ac
 R13: 000000000000000b R14: 0000000000437250 R15: 00007f58419f3000
  </TASK>
 Kernel panic - not syncing: kernel: panic_on_warn set ...

> WARNING: CPU: 0 PID: 13067 at mm/gup.c:229 try_grab_page+0x2dd/0x3a0

That's this line:
        if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
Called from:
  follow_page_pte+0x18c/0x1610

That did:
        ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
        pte = ptep_get(ptep);
        page = vm_normal_page(vma, address, pte);
        ret = try_grab_page(page, flags);

So we grabbed the PTE lock, looked up the PTE, translated that into
a page ... and found a page with a zero (or negative) refcount.
That's Really Bad.  I think it was a zero refcount because r08 is 0
and I don't see any other registers which have a plausible negative
32-bit number in them.

Yikebaer, could I trouble you to add this:

+++ b/mm/gup.c
@@ -226,7 +226,7 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
 {
        struct folio *folio = page_folio(page);

-       if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
+       if (VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio) <= 0, folio))
                return -ENOMEM;

        if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))

and rerun the syzkaller?  That'll give us some more information about
what has happened, although it won't tell us why it happened.

We might need to catch someone decrementing the refcount to lower than
the mapcount to catch this ... which will be tricky, given the other
things we reuse the mapcount for.