On 2024/3/2 5:47, Matthew Wilcox (Oracle) wrote: > Oscar and I have been exchanging a bit of email recently about the > bug reported here: > https://lore.kernel.org/all/ZXNhGsX32y19a2Xv@xxxxxxxxxxxxxxxxxxxx Thanks for your patch. > > I've come to the conclusion that folio_test_hugetlb() is just too fragile > as it can give both false positives and false negatives, as well as > resulting in the above bug. With this patch series, it becomes a lot > more robust. In the memory-failure case, we always hold the hugetlb_lock > so it's perfectly reliable. In the compaction caase, it's unreliable, but > the failures are acceptable and we recheck after taking the hugetlb_lock. I encountered similar issues with PageSwapCache check when doing memory-failure test: [66258.945079] page:00000000135e1205 refcount:1 mapcount:0 mapping:0000000000000000 index:0x9b pfn:0xa04e9a [66258.949096] head:0000000038449724 order:9 entire_mapcount:1 nr_pages_mapped:0 pincount:0 [66258.949485] memcg:ffff95fb43379000 [66258.950334] anon flags: 0x6fffc00000a0068(uptodate|lru|head|mappedtodisk|swapbacked|node=1|zone=2|lastcpupid=0x3fff) [66258.951212] page_type: 0xffffffff() [66258.951882] raw: 06fffc0000000000 ffffc89628138001 dead000000000122 dead000000000400 [66258.952273] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 [66258.952884] head: 06fffc00000a0068 ffffc896218a8008 ffffc89621680008 ffff95fb4349c439 [66258.953239] head: 0000000700000600 0000000000000000 00000001ffffffff ffff95fb43379000 [66258.953725] page dumped because: VM_BUG_ON_PAGE(PageTail(page)) [66258.954497] ------------[ cut here ]------------ [66258.954937] kernel BUG at include/linux/page-flags.h:313! [66258.956502] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [66258.957001] CPU: 14 PID: 174237 Comm: page-types Kdump: loaded Not tainted 6.8.0-rc1-00162-gd162e170f118 #11 [66258.957001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [66258.958415] RIP: 0010:folio_flags.constprop.0+0x1c/0x50 [66258.958415] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 8b 57 08 48 89 f8 83 e2 01 74 12 48 c7 c6 a0 59 34 a7 48 89 c7 e8 b5 60 e8 ff 90 <0f> 0b 66 90 c3 cc cc cc cc f7 c7 ff 0f 00 00 75 1a 48 8b 17 83 e2 [66258.958415] RSP: 0018:ffffa0f38ae53e00 EFLAGS: 00000282 [66258.958415] RAX: 0000000000000033 RBX: 0000000000000000 RCX: ffff96031fd9c9c8 [66258.958415] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff96031fd9c9c0 [66258.958415] RBP: ffffc8962813a680 R08: ffffffffa7756f88 R09: 0000000000009ffb [66258.962155] R10: 000000000000054a R11: ffffffffa7726fa0 R12: 06fffc0000000000 [66258.962155] R13: 0000000000000000 R14: 00007fff93bf1348 R15: 0000000000a04e9a [66258.962155] FS: 00007f47cc5c4740(0000) GS:ffff96031fd80000(0000) knlGS:0000000000000000 [66258.962155] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [66258.962155] CR2: 00007fff93c7b000 CR3: 0000000850c28000 CR4: 00000000000006f0 [66258.962155] Call Trace: [66258.962155] <TASK> [66258.965730] ? die+0x32/0x90 [66258.965730] ? do_trap+0xdf/0x110 [66258.965730] ? folio_flags.constprop.0+0x1c/0x50 [66258.965730] ? do_error_trap+0x8b/0x110 [66258.965730] ? folio_flags.constprop.0+0x1c/0x50 [66258.965730] ? folio_flags.constprop.0+0x1c/0x50 [66258.965730] ? exc_invalid_op+0x53/0x70 [66258.965730] ? folio_flags.constprop.0+0x1c/0x50 [66258.965730] ? asm_exc_invalid_op+0x1a/0x20 [66258.965730] ? folio_flags.constprop.0+0x1c/0x50 [66258.965730] stable_page_flags+0x210/0x940 [66258.965730] kpageflags_read+0x97/0xf0 [66258.965730] vfs_read+0xa0/0x370 [66258.965730] __x64_sys_pread64+0x90/0xc0 [66258.965730] do_syscall_64+0xcd/0x1e0 [66258.965730] entry_SYSCALL_64_after_hwframe+0x6f/0x77 [66258.965730] RIP: 0033:0x7f47cc31274a [66258.969711] Code: 44 24 78 00 00 00 00 e9 2b f1 ff ff 0f 1f 40 00 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24 [66258.969711] RSP: 002b:00007fff93af1298 EFLAGS: 00000246 ORIG_RAX: 0000000000000011 [66258.969711] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f47cc31274a [66258.969711] RDX: 0000000000000008 RSI: 00007fff93bf1340 RDI: 0000000000000004 [66258.969711] RBP: 00007fff93af12e0 R08: 0000000000000001 R09: 8100000000a04e99 [66258.969711] R10: 00000000050274d0 R11: 0000000000000246 R12: 00007fff93cf1588 [66258.972680] R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f47cc609040 [66258.972680] </TASK> [66258.972680] Modules linked in: mce_inject hwpoison_inject After debugging, I think below race leads to the above panic: CPU1 CPU2 kpageflags_read stable_page_flags PageSwapCache() check 4k page without page refcnt held folio_test_swapcache(page_folio(page)); folio_test_swapbacked(folio) && /* page is swapbacked. */ page is freed into buddy and merged into larger order. page is allocated as THP tail page. test_bit(PG_swapcache, folio_flags(folio, 0)); /* BUG_ON PageTail check in folio_flags. It's tail page now! */ So the PageSwapCache test is fragile too. Any thought on how to fix this 'similar' issue? Thanks. > > The cost of this reliability is that we now consume the word I recently > freed in folio->page[1]. I think this is acceptable; we've still gained > a completely reliable folio_test_hugetlb() (which we didn't have before > I started messing around with the folio dtors). Non-hugetlb users > can use large_id as a pointer to something else entirely, or even as a > non-pointer, as long as they can guarantee it can't conflict (ie don't > use it as a bitfield). > > So far, this is working for me. Some stress testing would be appreciated. > > Matthew Wilcox (Oracle) (5): > hugetlb: Make folio_test_hugetlb safer to call > hugetlb: Add hugetlb_pfn_folio > memory-failure: Use hugetlb_pfn_folio > memory-failure: Reorganise get_huge_page_for_hwpoison() > compaction: Use hugetlb_pfn_folio in isolate_migratepages_block > > include/linux/hugetlb.h | 13 ++----- > include/linux/mm.h | 8 ----- > include/linux/mm_types.h | 4 ++- > include/linux/page-flags.h | 25 +++---------- > kernel/vmcore_info.c | 3 +- > mm/compaction.c | 16 ++++----- > mm/huge_memory.c | 10 ++---- > mm/hugetlb.c | 72 +++++++++++++++++++++++++++++--------- > mm/memory-failure.c | 14 +++++--- > 9 files changed, 87 insertions(+), 78 deletions(-) >