David Hildenbrand <david@xxxxxxxxxx> writes: > On 24.11.22 04:33, Matthew Wilcox wrote: >> On Thu, Nov 24, 2022 at 12:06:56PM +1100, Alistair Popple wrote: >>> >>> David Hildenbrand <david@xxxxxxxxxx> writes: >>> >>>> On 23.11.22 06:14, Hugh Dickins wrote: >>>>> On Wed, 23 Nov 2022, Gavin Shan wrote: >>>>> >>>>>> The issue is reported when removing memory through virtio_mem device. >>>>>> The transparent huge page, experienced copy-on-write fault, is wrongly >>>>>> regarded as pinned. The transparent huge page is escaped from being >>>>>> isolated in isolate_migratepages_block(). The transparent huge page >>>>>> can't be migrated and the corresponding memory block can't be put >>>>>> into offline state. >>>>>> >>>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this, >>>>>> the transparent huge page can be isolated and migrated, and the memory >>>>>> block can be put into offline state. >>>>>> >>>>>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP") >>>>>> Cc: stable@xxxxxxxxxxxxxxx # v5.8+ >>>>>> Reported-by: Zhenyu Zhang <zhenyzha@xxxxxxxxxx> >>>>>> Suggested-by: David Hildenbrand <david@xxxxxxxxxx> >>>>>> Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx> >>>>> Interesting, good catch, looked right to me: except for the Fixes >>>>> line >>>>> and mention of v5.8. That CoW change may have added a case which easily >>>>> demonstrates the problem, but it would have been the wrong test on a THP >>>>> for long before then - but only in v5.7 were compound pages allowed >>>>> through at all to reach that test, so I think it should be >>>>> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for >>>>> CMA allocations") >>>>> Cc: stable@xxxxxxxxxxxxxxx # v5.7+ >>>>> Oh, no, stop: this is not so easy, even in the latest tree. >>>>> Because at the time of that "admittedly racy check", we have no hold >>>>> at all on the page in question: and if it's PageLRU or PageCompound >>>>> at one instant, it may be different the next instant. Which leaves it >>>>> vulnerable to whatever BUG_ON()s there may be in the total_mapcount() >>>>> path - needs research. *Perhaps* there are no more BUG_ON()s in the >>>>> total_mapcount() path than in the existing page_mapcount() path. >>>>> I suspect that for this to be safe (before your patch and more so >>>>> after), >>>>> it will be necessary to shift the "admittedly racy check" down after the >>>>> get_page_unless_zero() (and check the sequence of operations when a >>>>> compound page is initialized). >>>> >>>> Grabbing a reference first sounds like the right approach to me. >>> >>> I think you're right. Without a page reference I don't think it is even >>> safe to look at struct page, at least not without synchronisation >>> against memory hot unplug which could remove the struct page. From a >>> quick glance I didn't see anything here that obviously did that though. >> Memory hotplug is the offending party here. It has to make sure >> that >> everything else is definitely quiescent before removing the struct pages. >> Otherwise you can't even try_get a refcount. Sure, I might be missing something but how can memory hotplug possibly synchronise against some process (eg. try_to_compact_pages) doing try_get(pfn_to_page(random_pfn)) without that process either informing memory hotplug that it needs pages to remain valid long enough to get a reference or detecting that memory hotplug is in the process of offlining pages? > At least alloc_contig_range() and memory offlining are mutually > exclusive due to MIGRATE_ISOLTAE. I recall that ordinary memory > compaction similarly deals with isolated pageblocks (or some other > mechanism I forgot) to not race with memory offlining. Wouldn't worry > about that for now. Sorry, agree - to be clear this discussion isn't relevant for this patch but reviewing it got me thinking about how this works. I still don't see how alloc_contig_range() is totally safe against memory offlining though. From what I can see we have the following call path to set MIGRATE_ISOLATE: alloc_contig_range(pfn) -> start_isolate_page_range(pfn) -> isolate_single_pageblock(pfn) -> page_zone(pfn_to_page(pfn)) There's nothing in that call stack that prevent offlining of the page, hence the struct page may be freed by this point. Am I missing something else here? try_to_compact_pages() has a similar issue which is what I noticed reviewing this patch: try_to_compact_pages() -> compact_zone_order() -> compact_zone() -> isolate_migratepages() -> isolate_migratepages_block() -> PageHuge(pfn_to_page(pfn)) Again I couldn't see anything in that path that would hold the page stable long enough to safely perform the pfn_to_page() and get a reference. And after a bit of fiddling I was able to trigger the below oops running the compaction_test selftest whilst offlining memory so I don't think it is safe: Entering kdb (current=0xffff8882452fb6c0, pid 5646) on processor 1 Oops: (null) due to oops @ 0xffffffff81af6486 CPU: 1 PID: 5646 Comm: compaction_test Not tainted 6.0.0-01424-gf3ec7e734795 #110 Hardware name: Gigabyte Technology Co., Ltd. B150M-D3H/B150M-D3H-CF, BIOS F24 04/11/2018 RIP: 0010:PageHuge+0xa6/0x180 Code: 00 0f 85 d0 00 00 00 48 8b 43 08 a8 01 0f 85 97 00 00 00 66 90 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 02 7e 7f 31 c0 80 7b 50 02 0f 94 c0 5b 41 5c RSP: 0000:ffffc9001252efa8 EFLAGS: 00010207 RAX: dffffc0000000000 RBX: fffffffffffffffe RCX: ffffffff81af63f9 RDX: 0000000000000009 RSI: 0000000000000008 RDI: 000000000000004e RBP: ffffc9001252efb8 R08: 0000000000000000 R09: ffffea000f690007 R10: fffff94001ed2000 R11: 0000000000000001 R12: ffffea000f690008 R13: 0000000000000ab3 R14: ffffea000f690000 R15: 00000000003da400 FS: 00007f83e08b7740(0000) GS:ffff88823bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb6e1cbb3e0 CR3: 0000000344044003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> isolate_migratepages_block+0x43c/0x3dc0 ? reclaim_throttle+0x7a0/0x7a0 ? __reset_isolation_suitable+0x350/0x350 compact_zone+0xb73/0x31f0 ? compaction_suitable+0x1e0/0x1e0 compact_zone_order+0x127/0x240 ? compact_zone+0x31f0/0x31f0 ? __this_cpu_preempt_check+0x13/0x20 ? cpuusage_write+0x380/0x380 ? __kasan_check_read+0x11/0x20 try_to_compact_pages+0x23b/0x770 ? psi_task_change+0x201/0x300 __alloc_pages_direct_compact+0x15d/0x650 ? get_page_from_freelist+0x3fa0/0x3fa0 ? psi_task_change+0x201/0x300 ? _raw_spin_unlock+0x19/0x40 __alloc_pages_slowpath.constprop.0+0x9e1/0x2260 ? warn_alloc+0x1a0/0x1a0 ? __zone_watermark_ok+0x430/0x430 ? prepare_alloc_pages+0x40b/0x620 __alloc_pages+0x42f/0x540 ? __alloc_pages_slowpath.constprop.0+0x2260/0x2260 alloc_buddy_huge_page.isra.0+0x7c/0x130 alloc_fresh_huge_page+0x1f1/0x4e0 alloc_pool_huge_page+0xab/0x2d0 __nr_hugepages_store_common+0x37a/0xed0 ? return_unused_surplus_pages+0x330/0x330 ? __kasan_check_write+0x14/0x20 ? _raw_spin_lock_irqsave+0x99/0x100 ? proc_doulongvec_minmax+0x54/0x80 hugetlb_sysctl_handler_common+0x247/0x320 ? nr_hugepages_store+0xf0/0xf0 ? alloc_huge_page+0xbf0/0xbf0 hugetlb_sysctl_handler+0x20/0x30 proc_sys_call_handler+0x451/0x650 ? unregister_sysctl_table+0x1c0/0x1c0 ? apparmor_file_permission+0x124/0x280 ? security_file_permission+0x72/0x90 proc_sys_write+0x13/0x20 vfs_write+0x7ca/0xd80 ? kernel_write+0x4d0/0x4d0 ? do_sys_openat2+0x114/0x450 ? __kasan_check_write+0x14/0x20 ? down_write+0xb4/0x130 ksys_write+0x116/0x220 ? __kasan_check_write+0x14/0x20 ? __ia32_sys_read+0xb0/0xb0 ? debug_smp_processor_id+0x17/0x20 ? fpregs_assert_state_consistent+0x52/0xc0 __x64_sys_write+0x73/0xb0 ? syscall_exit_to_user_mode+0x26/0x50 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f83e09b2190 Code: 40 00 48 8b 15 71 9c 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d 51 24 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89 RSP: 002b:00007ffe4c08e478 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007ffe4c08e648 RCX: 00007f83e09b2190 RDX: 0000000000000006 RSI: 000055d7575611f8 RDI: 0000000000000003 RBP: 00007ffe4c08e4c0 R08: 00007f83e0a8cc60 R09: 0000000000000000 R10: 00007f83e08d40b8 R11: 0000000000000202 R12: 0000000000000000 R13: 00007ffe4c08e658 R14: 000055d757562df0 R15: 00007f83e0adc020 </TASK>