On Fri, Aug 9, 2024 at 3:31 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 08.08.24 23:34, Pasha Tatashin wrote: > > Fix invalid access to pgdat during hot-remove operation: > > ndctl users reported a GPF when trying to destroy a namespace: > > $ ndctl destroy-namespace all -r all -f > > Segmentation fault > > dmesg: > > Oops: general protection fault, probably for > > non-canonical address 0xdffffc0000005650: 0000 [#1] PREEMPT SMP KASAN > > PTI > > KASAN: probably user-memory-access in range > > [0x000000000002b280-0x000000000002b287] > > CPU: 26 UID: 0 PID: 1868 Comm: ndctl Not tainted 6.11.0-rc1 #1 > > Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS > > 2.20.1 09/13/2023 > > RIP: 0010:mod_node_page_state+0x2a/0x110 > > > > cxl-test users report a GPF when trying to unload the test module: > > $ modrpobe -r cxl-test > > dmesg > > BUG: unable to handle page fault for address: 0000000000004200 > > #PF: supervisor read access in kernel mode > > #PF: error_code(0x0000) - not-present page > > PGD 0 P4D 0 > > Oops: Oops: 0000 [#1] PREEMPT SMP PTI > > CPU: 0 UID: 0 PID: 1076 Comm: modprobe Tainted: G O N 6.11.0-rc1 #197 > > Tainted: [O]=OOT_MODULE, [N]=TEST > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/15 > > RIP: 0010:mod_node_page_state+0x6/0x90 > > > > Currently, when memory is hot-plugged or hot-removed the accounting is > > done based on the assumption that memmap is allocated from the same node > > as the hot-plugged/hot-removed memory, which is not always the case. > > > > In addition, there are challenges with keeping the node id of the memory > > that is being remove to the time when memmap accounting is actually > > performed: since this is done after remove_pfn_range_from_zone(), and > > also after remove_memory_block_devices(). Meaning that we cannot use > > pgdat nor walking though memblocks to get the nid. > > > > Given all of that, account the memmap overhead system wide instead. > > > > For this we are going to be using global atomic counters, but given that > > memmap size is rarely modified, and normally is only modified either > > during early boot when there is only one CPU, or under a hotplug global > > mutex lock, therefore there is no need for per-cpu optimizations. > > > > Also, while we are here rename nr_memmap to nr_memmap_pages, and > > nr_memmap_boot to nr_memmap_boot_pages to be self explanatory that the > > units are in page count. > > > > Reported-by: Yi Zhang <yi.zhang@xxxxxxxxxx> > > Closes: https://lore.kernel.org/linux-cxl/CAHj4cs9Ax1=CoJkgBGP_+sNu6-6=6v=_L-ZBZY0bVLD3wUWZQg@xxxxxxxxxxxxxx > > Reported-by: Alison Schofield <alison.schofield@xxxxxxxxx> > > Closes: https://lore.kernel.org/linux-mm/Zq0tPd2h6alFz8XF@aschofie-mobl2/#t > > > > Fixes: 15995a352474 ("mm: report per-page metadata information") > > Signed-off-by: Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> > > Tested-by: Dan Williams <dan.j.williams@xxxxxxxxx> > > --- > > [...] > > In general > > Acked-by: David Hildenbrand <david@xxxxxxxxxx> > > Two nits below: > > > > static void free_map_bootmem(struct page *memmap) > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > index 6f8aa4766f16..ad82c1bf0e63 100644 > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -1033,6 +1033,23 @@ unsigned long node_page_state(struct pglist_data *pgdat, > > } > > #endif > > > > +/* > > + * Count number of pages "struct page" and "struct page_ext" consume. > > + * nr_memmap_boot: # of pages allocated by boot allocator & not part of MemTotal > > + * nr_memmap: # of pages that were allocated by buddy allocator > > + */ > > +static atomic_long_t nr_memmap_boot, nr_memmap; > > I *think* the clean and portable way to do it is use ATOMIC_INIT(0) for > both. [even though what you have likely works on all archs] Yeah, it is not necessary, but I will add ATOMIC_LONG_INIT(0), > > > + > > +void mod_memmap_boot(long delta) > > +{ > > + atomic_long_add(delta, &nr_memmap_boot); > > +} > > + > > +void mod_memmap(long delta) > > +{ > > + atomic_long_add(delta, &nr_memmap); > > +} > > + > > Nit picking: (up to you) > > I'd do it similar to totalram_pages_add(): > > memmap_pages_add() > memmap_boot_pages_add() > > And call the variables something like > > static atomic_long_t memmap_pages_boot, memmap_pages; Sure, I will rename them. Pasha