On 08.08.24 23:34, Pasha Tatashin wrote:
Fix invalid access to pgdat during hot-remove operation:
ndctl users reported a GPF when trying to destroy a namespace:
$ ndctl destroy-namespace all -r all -f
Segmentation fault
dmesg:
Oops: general protection fault, probably for
non-canonical address 0xdffffc0000005650: 0000 [#1] PREEMPT SMP KASAN
PTI
KASAN: probably user-memory-access in range
[0x000000000002b280-0x000000000002b287]
CPU: 26 UID: 0 PID: 1868 Comm: ndctl Not tainted 6.11.0-rc1 #1
Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS
2.20.1 09/13/2023
RIP: 0010:mod_node_page_state+0x2a/0x110
cxl-test users report a GPF when trying to unload the test module:
$ modrpobe -r cxl-test
dmesg
BUG: unable to handle page fault for address: 0000000000004200
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 UID: 0 PID: 1076 Comm: modprobe Tainted: G O N 6.11.0-rc1 #197
Tainted: [O]=OOT_MODULE, [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/15
RIP: 0010:mod_node_page_state+0x6/0x90
Currently, when memory is hot-plugged or hot-removed the accounting is
done based on the assumption that memmap is allocated from the same node
as the hot-plugged/hot-removed memory, which is not always the case.
In addition, there are challenges with keeping the node id of the memory
that is being remove to the time when memmap accounting is actually
performed: since this is done after remove_pfn_range_from_zone(), and
also after remove_memory_block_devices(). Meaning that we cannot use
pgdat nor walking though memblocks to get the nid.
Given all of that, account the memmap overhead system wide instead.
For this we are going to be using global atomic counters, but given that
memmap size is rarely modified, and normally is only modified either
during early boot when there is only one CPU, or under a hotplug global
mutex lock, therefore there is no need for per-cpu optimizations.
Also, while we are here rename nr_memmap to nr_memmap_pages, and
nr_memmap_boot to nr_memmap_boot_pages to be self explanatory that the
units are in page count.
Reported-by: Yi Zhang <yi.zhang@xxxxxxxxxx>
Closes: https://lore.kernel.org/linux-cxl/CAHj4cs9Ax1=CoJkgBGP_+sNu6-6=6v=_L-ZBZY0bVLD3wUWZQg@xxxxxxxxxxxxxx
Reported-by: Alison Schofield <alison.schofield@xxxxxxxxx>
Closes: https://lore.kernel.org/linux-mm/Zq0tPd2h6alFz8XF@aschofie-mobl2/#t
Fixes: 15995a352474 ("mm: report per-page metadata information")
Signed-off-by: Pasha Tatashin <pasha.tatashin@xxxxxxxxxx>
Tested-by: Dan Williams <dan.j.williams@xxxxxxxxx>
---
[...]
In general
Acked-by: David Hildenbrand <david@xxxxxxxxxx>
Two nits below:
static void free_map_bootmem(struct page *memmap)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6f8aa4766f16..ad82c1bf0e63 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1033,6 +1033,23 @@ unsigned long node_page_state(struct pglist_data *pgdat,
}
#endif
+/*
+ * Count number of pages "struct page" and "struct page_ext" consume.
+ * nr_memmap_boot: # of pages allocated by boot allocator & not part of MemTotal
+ * nr_memmap: # of pages that were allocated by buddy allocator
+ */
+static atomic_long_t nr_memmap_boot, nr_memmap;
I *think* the clean and portable way to do it is use ATOMIC_INIT(0) for
both. [even though what you have likely works on all archs]
+
+void mod_memmap_boot(long delta)
+{
+ atomic_long_add(delta, &nr_memmap_boot);
+}
+
+void mod_memmap(long delta)
+{
+ atomic_long_add(delta, &nr_memmap);
+}
+
Nit picking: (up to you)
I'd do it similar to totalram_pages_add():
memmap_pages_add()
memmap_boot_pages_add()
And call the variables something like
static atomic_long_t memmap_pages_boot, memmap_pages;
--
Cheers,
David / dhildenb