On 11.07.23 16:30, Aneesh Kumar K.V wrote:
David Hildenbrand <david@xxxxxxxxxx> writes:
On 16.06.23 00:00, Vishal Verma wrote:
With DAX memory regions originating from CXL memory expanders or
NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
on a system without enough 'regular' main memory to support the memmap
for it. To avoid this, ensure that all kmem managed hotplugged memory is
added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
new memory region being hot added.
To do this, call add_memory() in chunks of memory_block_size_bytes() as
that is a requirement for memmap_on_memory. Additionally, Use the
mhp_flag to force the memmap_on_memory checks regardless of the
respective module parameter setting.
Cc: "Rafael J. Wysocki" <rafael@xxxxxxxxxx>
Cc: Len Brown <lenb@xxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: David Hildenbrand <david@xxxxxxxxxx>
Cc: Oscar Salvador <osalvador@xxxxxxx>
Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
Cc: Dave Jiang <dave.jiang@xxxxxxxxx>
Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: Huang Ying <ying.huang@xxxxxxxxx>
Signed-off-by: Vishal Verma <vishal.l.verma@xxxxxxxxx>
---
drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
1 file changed, 36 insertions(+), 13 deletions(-)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 7b36db6f1cbd..0751346193ef 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -12,6 +12,7 @@
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/memory-tiers.h>
+#include <linux/memory_hotplug.h>
#include "dax-private.h"
#include "bus.h"
@@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
data->mgid = rc;
for (i = 0; i < dev_dax->nr_range; i++) {
+ u64 cur_start, cur_len, remaining;
struct resource *res;
struct range range;
@@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
res->flags = IORESOURCE_SYSTEM_RAM;
/*
- * Ensure that future kexec'd kernels will not treat
- * this as RAM automatically.
+ * Add memory in chunks of memory_block_size_bytes() so that
+ * it is considered for MHP_MEMMAP_ON_MEMORY
+ * @range has already been aligned to memory_block_size_bytes(),
+ * so the following loop will always break it down cleanly.
*/
- rc = add_memory_driver_managed(data->mgid, range.start,
- range_len(&range), kmem_name, MHP_NID_IS_MGID);
+ cur_start = range.start;
+ cur_len = memory_block_size_bytes();
+ remaining = range_len(&range);
+ while (remaining) {
+ mhp_t mhp_flags = MHP_NID_IS_MGID;
- if (rc) {
- dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
- i, range.start, range.end);
- remove_resource(res);
- kfree(res);
- data->res[i] = NULL;
- if (mapped)
- continue;
- goto err_request_mem;
+ if (mhp_supports_memmap_on_memory(cur_len,
+ MHP_MEMMAP_ON_MEMORY))
+ mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+ /*
+ * Ensure that future kexec'd kernels will not treat
+ * this as RAM automatically.
+ */
+ rc = add_memory_driver_managed(data->mgid, cur_start,
+ cur_len, kmem_name,
+ mhp_flags);
+
+ if (rc) {
+ dev_warn(dev,
+ "mapping%d: %#llx-%#llx memory add failed\n",
+ i, cur_start, cur_start + cur_len - 1);
+ remove_resource(res);
+ kfree(res);
+ data->res[i] = NULL;
+ if (mapped)
+ continue;
+ goto err_request_mem;
+ }
+
+ cur_start += cur_len;
+ remaining -= cur_len;
}
mapped++;
}
Maybe the better alternative is teach
add_memory_resource()/try_remove_memory() to do that internally.
In the add_memory_resource() case, it might be a loop around that
memmap_on_memory + arch_add_memory code path (well, and the error path
also needs adjustment):
/*
* Self hosted memmap array
*/
if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
if (!mhp_supports_memmap_on_memory(size)) {
ret = -EINVAL;
goto error;
}
mhp_altmap.free = PHYS_PFN(size);
mhp_altmap.base_pfn = PHYS_PFN(start);
params.altmap = &mhp_altmap;
}
/* call arch's memory hotadd */
ret = arch_add_memory(nid, start, size, ¶ms);
if (ret < 0)
goto error;
Note that we want to handle that on a per memory-block basis, because we
don't want the vmemmap of memory block #2 to end up on memory block #1.
It all gets messy with memory onlining/offlining etc otherwise ...
I tried to implement this inside add_memory_driver_managed() and also
within dax/kmem. IMHO doing the error handling inside dax/kmem is
better. Here is how it looks:
1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
2. For each succesful request_mem_regions if any blocks got added, we
keep the resource. If none got added, we will kfree the resource
Doing this unconditional splitting outside of
add_memory_driver_managed() is undesirable for at least two reasons
1) You end up always creating individual entries in the resource tree
(/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective.
2) As we call arch_add_memory() in memory block granularity (e.g., 128
MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the
identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective.
While you could sense for support and do the split based on that, it
will be beneficial for other users (especially DIMMs) if we do that
internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be
effective or not.
In general, we avoid placing important kernel data-structures on slow
memory. That's one of the reasons why PMEM decided to mostly always use
ZONE_MOVABLE such that exactly what this patch does would not happen. So
I'm wondering if there would be demand for an additional toggle.
Because even with memmap_on_memory enabled in general, you might not
want to do that for dax/kmem.
IMHO, this patch should be dropped from your ppc64 series, as it's an
independent change that might be valuable for other architectures as well.
--
Cheers,
David / dhildenb