On Wed, 2020-12-09 at 23:43 +0200, Mike Rapoport wrote: > From: Mike Rapoport <rppt@xxxxxxxxxxxxx> > > There could be struct pages that are not backed by actual physical memory. > This can happen when the actual memory bank is not a multiple of > SECTION_SIZE or when an architecture does not register memory holes > reserved by the firmware as memblock.memory. > > Such pages are currently initialized using init_unavailable_mem() function > that iterated through PFNs in holes in memblock.memory and if there is a > struct page corresponding to a PFN, the fields if this page are set to > default values and it is marked as Reserved. > > init_unavailable_mem() does not take into account zone and node the page > belongs to and sets both zone and node links in struct page to zero. > > On a system that has firmware reserved holes in a zone above ZONE_DMA, for > instance in a configuration below: > > # grep -A1 E820 /proc/iomem > 7a17b000-7a216fff : Unknown E820 type > 7a217000-7bffffff : System RAM > > unset zone link in struct page will trigger > > VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); > > because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link in > struct page) in the same pageblock. > > Interleave initialization of pages that correspond to holes with the > initialization of memory map, so that zone and node information will be > properly set on such pages. > > Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather > that check each PFN") > Reported-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> > Signed-off-by: Mike Rapoport <rppt@xxxxxxxxxxxxx> Reverting this commit on the top of today's linux-next fixed a crash while reading /proc/kpagecount on a NUMA server. [ 8858.006726][T99897] BUG: unable to handle page fault for address: fffffffffffffffe [ 8858.014814][T99897] #PF: supervisor read access in kernel mode [ 8858.020686][T99897] #PF: error_code(0x0000) - not-present page [ 8858.026557][T99897] PGD 1371417067 P4D 1371417067 PUD 1371419067 PMD 0 [ 8858.033224][T99897] Oops: 0000 [#1] SMP KASAN NOPTI [ 8858.038710][T99897] CPU: 28 PID: 99897 Comm: proc01 Tainted: G O 5.11.0-rc1-next-20210104 #1 [ 8858.048515][T99897] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 03/09/2018 [ 8858.057794][T99897] RIP: 0010:kpagecount_read+0x1be/0x5e0 PageSlab at include/linux/page-flags.h:342 (inlined by) kpagecount_read at fs/proc/page.c:69 [ 8858.063717][T99897] Code: 3c 30 00 0f 85 29 03 00 00 48 8b 53 08 48 8d 42 ff 83 e2 01 48 0f 44 c3 48 89 c2 48 c1 ea 03 42 80 3c 32 00 0f 85 e7 02 00 00 <48> 83 38 ff 0f 84 f3 01 00 00 48 89 c8 48 c1 e8 03 42 80 3c 30 00 [ 8858.083303][T99897] RSP: 0018:ffffc9002159fdd0 EFLAGS: 00010246 [ 8858.089637][T99897] RAX: fffffffffffffffe RBX: ffffea0011fce000 RCX: ffffea0011fce008 [ 8858.097518][T99897] RDX: 1fffffffffffffff RSI: 000000000064d7c0 RDI: ffffffff951f91c8 [ 8858.105396][T99897] RBP: 000000000064d7c0 R08: ffffed129063f402 R09: ffffed129063f402 [ 8858.113760][T99897] R10: ffff8894831fa00b R11: ffffed129063f401 R12: 000000000047f380 [ 8858.121639][T99897] R13: 0000000000000400 R14: dffffc0000000000 R15: 000000000064d7c0 [ 8858.129517][T99897] FS: 00007fd18849d040(0000) GS:ffff88a02fc00000(0000) knlGS:0000000000000000 [ 8858.138886][T99897] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8858.145369][T99897] CR2: fffffffffffffffe CR3: 0000001c8b5d0000 CR4: 00000000003506e0 [ 8858.153247][T99897] Call Trace: [ 8858.156415][T99897] proc_reg_read+0x1a6/0x240 [ 8858.161345][T99897] vfs_read+0x175/0x440 [ 8858.165383][T99897] ksys_read+0xf1/0x1c0 [ 8858.169420][T99897] ? vfs_write+0x870/0x870 [ 8858.173719][T99897] ? task_work_run+0xeb/0x170 [ 8858.178284][T99897] ? syscall_enter_from_user_mode+0x1c/0x40 [ 8858.184073][T99897] do_syscall_64+0x33/0x40 [ 8858.188863][T99897] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 8858.194652][T99897] RIP: 0033:0x7fd187da1d5d [ 8858.198952][T99897] Code: 31 11 2b 00 31 c9 64 83 3e 0b 75 ca eb b8 e8 ca fb ff ff 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 39 ca 77 2b 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 0b c3 66 2e 0f 1f 84 00 00 00 00 00 48 8b 15 [ 8858.218978][T99897] RSP: 002b:00007ffe733de1f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 8858.227297][T99897] RAX: ffffffffffffffda RBX: 00007ffe733df370 RCX: 00007fd187da1d5d [ 8858.235824][T99897] RDX: 0000000000000400 RSI: 000000000064d7c0 RDI: 0000000000000004 [ 8858.243739][T99897] RBP: 0000000000000400 R08: 00000000018fbe73 R09: 00007fd187e13d40 [ 8858.251617][T99897] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000023f9c00 [ 8858.259496][T99897] R13: 0000000000000004 R14: 000000000044663c R15: 0000000000000000 [ 8858.267856][T99897] Modules linked in: vfat fat fuse vfio_pci vfio_virqfd vfio_iommu_type1 vfio loop iavf kvm_amd ses kvm enclosure irqbypass acpi_cpufreq ip_tables x_tables sd_mod smartpqi bnxt_en scsi_transport_sas tg3 i40e firmware_class libphy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: init_module] [ 8858.296328][T99897] CR2: fffffffffffffffe [ 8858.300365][T99897] ---[ end trace a307ff8b6e284ee0 ]--- [ 8858.305712][T99897] RIP: 0010:kpagecount_read+0x1be/0x5e0 [ 8858.311613][T99897] Code: 3c 30 00 0f 85 29 03 00 00 48 8b 53 08 48 8d 42 ff 83 e2 01 48 0f 44 c3 48 89 c2 48 c1 ea 03 42 80 3c 32 00 0f 85 e7 02 00 00 <48> 83 38 ff 0f 84 f3 01 00 00 48 89 c8 48 c1 e8 03 42 80 3c 30 00 [ 8858.331200][T99897] RSP: 0018:ffffc9002159fdd0 EFLAGS: 00010246 [ 8858.337573][T99897] RAX: fffffffffffffffe RBX: ffffea0011fce000 RCX: ffffea0011fce008 [ 8858.345454][T99897] RDX: 1fffffffffffffff RSI: 000000000064d7c0 RDI: ffffffff951f91c8 [ 8858.353333][T99897] RBP: 000000000064d7c0 R08: ffffed129063f402 R09: ffffed129063f402 [ 8858.361618][T99897] R10: ffff8894831fa00b R11: ffffed129063f401 R12: 000000000047f380 [ 8858.369497][T99897] R13: 0000000000000400 R14: dffffc0000000000 R15: 000000000064d7c0 [ 8858.377377][T99897] FS: 00007fd18849d040(0000) GS:ffff88a02fc00000(0000) knlGS:0000000000000000 [ 8858.386696][T99897] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8858.393177][T99897] CR2: fffffffffffffffe CR3: 0000001c8b5d0000 CR4: 00000000003506e0 [ 8858.401056][T99897] Kernel panic - not syncing: Fatal exception [ 8858.407348][T99897] Kernel Offset: 0x12600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 8858.419260][T99897] ---[ end Kernel panic - not syncing: Fatal exception ]--- > --- > mm/page_alloc.c | 152 +++++++++++++++++++++--------------------------- > 1 file changed, 65 insertions(+), 87 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index dbc57dbbacd8..ea5aefef0004 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6185,24 +6185,85 @@ static void __meminit zone_init_free_lists(struct zone > *zone) > } > } > > -void __meminit __weak memmap_init(unsigned long size, int nid, > - unsigned long zone, > - unsigned long range_start_pfn) > +#if !defined(CONFIG_FLAT_NODE_MEM_MAP) > +/* > + * Only struct pages that are backed by physical memory available to the > + * kernel are zeroed and initialized by memmap_init_zone(). > + * But, there are some struct pages that are either reserved by firmware or > + * do not correspond to physical page frames becuase the actual memory bank > + * is not a multiple of SECTION_SIZE. > + * Fields of those struct pages may be accessed (for example page_to_pfn() > + * on some configuration accesses page flags) so we must explicitly > + * initialize those struct pages. > + */ > +static u64 __init init_unavailable_range(unsigned long spfn, unsigned long > epfn, > + int zone, int node) > { > - unsigned long start_pfn, end_pfn; > + unsigned long pfn; > + u64 pgcnt = 0; > + > + for (pfn = spfn; pfn < epfn; pfn++) { > + if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) { > + pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) > + + pageblock_nr_pages - 1; > + continue; > + } > + __init_single_page(pfn_to_page(pfn), pfn, zone, node); > + __SetPageReserved(pfn_to_page(pfn)); > + pgcnt++; > + } > + > + return pgcnt; > +} > +#else > +static inline u64 init_unavailable_range(unsigned long spfn, unsigned long > epfn, > + int zone, int node) > +{ > + return 0; > +} > +#endif > + > +void __init __weak memmap_init(unsigned long size, int nid, > + unsigned long zone, > + unsigned long range_start_pfn) > +{ > + unsigned long start_pfn, end_pfn, hole_start_pfn = 0; > unsigned long range_end_pfn = range_start_pfn + size; > + u64 pgcnt = 0; > int i; > > for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { > start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); > end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); > + hole_start_pfn = clamp(hole_start_pfn, range_start_pfn, > + range_end_pfn); > > if (end_pfn > start_pfn) { > size = end_pfn - start_pfn; > memmap_init_zone(size, nid, zone, start_pfn, > MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); > } > + > + if (hole_start_pfn < start_pfn) > + pgcnt += init_unavailable_range(hole_start_pfn, > + start_pfn, zone, nid); > + hole_start_pfn = end_pfn; > } > + > + /* > + * Early sections always have a fully populated memmap for the whole > + * section - see pfn_valid(). If the last section has holes at the > + * end and that section is marked "online", the memmap will be > + * considered initialized. Make sure that memmap has a well defined > + * state. > + */ > + if (hole_start_pfn < range_end_pfn) > + pgcnt += init_unavailable_range(hole_start_pfn, range_end_pfn, > + zone, nid); > + > + if (pgcnt) > + pr_info("%s: Zeroed struct page in unavailable ranges: %lld\n", > + zone_names[zone], pgcnt); > } > > static int zone_batchsize(struct zone *zone) > @@ -6995,88 +7056,6 @@ void __init free_area_init_memoryless_node(int nid) > free_area_init_node(nid); > } > > -#if !defined(CONFIG_FLAT_NODE_MEM_MAP) > -/* > - * Initialize all valid struct pages in the range [spfn, epfn) and mark them > - * PageReserved(). Return the number of struct pages that were initialized. > - */ > -static u64 __init init_unavailable_range(unsigned long spfn, unsigned long > epfn) > -{ > - unsigned long pfn; > - u64 pgcnt = 0; > - > - for (pfn = spfn; pfn < epfn; pfn++) { > - if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) { > - pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) > - + pageblock_nr_pages - 1; > - continue; > - } > - /* > - * Use a fake node/zone (0) for now. Some of these pages > - * (in memblock.reserved but not in memblock.memory) will > - * get re-initialized via reserve_bootmem_region() later. > - */ > - __init_single_page(pfn_to_page(pfn), pfn, 0, 0); > - __SetPageReserved(pfn_to_page(pfn)); > - pgcnt++; > - } > - > - return pgcnt; > -} > - > -/* > - * Only struct pages that are backed by physical memory are zeroed and > - * initialized by going through __init_single_page(). But, there are some > - * struct pages which are reserved in memblock allocator and their fields > - * may be accessed (for example page_to_pfn() on some configuration accesses > - * flags). We must explicitly initialize those struct pages. > - * > - * This function also addresses a similar issue where struct pages are left > - * uninitialized because the physical address range is not covered by > - * memblock.memory or memblock.reserved. That could happen when memblock > - * layout is manually configured via memmap=, or when the highest physical > - * address (max_pfn) does not end on a section boundary. > - */ > -static void __init init_unavailable_mem(void) > -{ > - phys_addr_t start, end; > - u64 i, pgcnt; > - phys_addr_t next = 0; > - > - /* > - * Loop through unavailable ranges not covered by memblock.memory. > - */ > - pgcnt = 0; > - for_each_mem_range(i, &start, &end) { > - if (next < start) > - pgcnt += init_unavailable_range(PFN_DOWN(next), > - PFN_UP(start)); > - next = end; > - } > - > - /* > - * Early sections always have a fully populated memmap for the whole > - * section - see pfn_valid(). If the last section has holes at the > - * end and that section is marked "online", the memmap will be > - * considered initialized. Make sure that memmap has a well defined > - * state. > - */ > - pgcnt += init_unavailable_range(PFN_DOWN(next), > - round_up(max_pfn, PAGES_PER_SECTION)); > - > - /* > - * Struct pages that do not have backing memory. This could be because > - * firmware is using some of this memory, or for some other reasons. > - */ > - if (pgcnt) > - pr_info("Zeroed struct page in unavailable ranges: %lld pages", > pgcnt); > -} > -#else > -static inline void __init init_unavailable_mem(void) > -{ > -} > -#endif /* !CONFIG_FLAT_NODE_MEM_MAP */ > - > #if MAX_NUMNODES > 1 > /* > * Figure out the number of possible node ids. > @@ -7507,7 +7486,6 @@ void __init free_area_init(unsigned long *max_zone_pfn) > /* Initialise every node */ > mminit_verify_pageflags_layout(); > setup_nr_node_ids(); > - init_unavailable_mem(); > for_each_online_node(nid) { > pg_data_t *pgdat = NODE_DATA(nid); > free_area_init_node(nid); > -- > 2.28.0 > >