On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote: > On 11.07.23 16:30, Aneesh Kumar K.V wrote: > > David Hildenbrand <david@xxxxxxxxxx> writes: > > > > > > Maybe the better alternative is teach > > > add_memory_resource()/try_remove_memory() to do that internally. > > > > > > In the add_memory_resource() case, it might be a loop around that > > > memmap_on_memory + arch_add_memory code path (well, and the error path > > > also needs adjustment): > > > > > > /* > > > * Self hosted memmap array > > > */ > > > if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { > > > if (!mhp_supports_memmap_on_memory(size)) { > > > ret = -EINVAL; > > > goto error; > > > } > > > mhp_altmap.free = PHYS_PFN(size); > > > mhp_altmap.base_pfn = PHYS_PFN(start); > > > params.altmap = &mhp_altmap; > > > } > > > > > > /* call arch's memory hotadd */ > > > ret = arch_add_memory(nid, start, size, ¶ms); > > > if (ret < 0) > > > goto error; > > > > > > > > > Note that we want to handle that on a per memory-block basis, because we > > > don't want the vmemmap of memory block #2 to end up on memory block #1. > > > It all gets messy with memory onlining/offlining etc otherwise ... > > > > > > > I tried to implement this inside add_memory_driver_managed() and also > > within dax/kmem. IMHO doing the error handling inside dax/kmem is > > better. Here is how it looks: > > > > 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions > > 2. For each succesful request_mem_regions if any blocks got added, we > > keep the resource. If none got added, we will kfree the resource > > > > Doing this unconditional splitting outside of > add_memory_driver_managed() is undesirable for at least two reasons > > 1) You end up always creating individual entries in the resource tree > (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective. > 2) As we call arch_add_memory() in memory block granularity (e.g., 128 > MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the > identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective. > > While you could sense for support and do the split based on that, it > will be beneficial for other users (especially DIMMs) if we do that > internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be > effective or not. I'm taking a shot at implementing the splitting internally in memory_hotplug.c. The caller (kmem) side does become trivial with this approach, but there's a slight complication if I don't have the module param override (patch 1 of this series). The kmem diff now looks like: diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 898ca9505754..8be932f63f90 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) data->mgid = rc; for (i = 0; i < dev_dax->nr_range; i++) { + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | + MHP_SPLIT_MEMBLOCKS; struct resource *res; struct range range; @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) * this as RAM automatically. */ rc = add_memory_driver_managed(data->mgid, range.start, - range_len(&range), kmem_name, MHP_NID_IS_MGID); + range_len(&range), kmem_name, mhp_flags); if (rc) { dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", However this begins to fail if the memmap_on_memory modparam is not set, as add_memory_driver_managed EINVALs from the mhp_supports_memmap_on_memory() check. The way to work around this would probably include doing the mhp_supports_memmap_on_memory() check in kmem, in a loop to check for each memblock sized chunk, and that feels like a leak of the implementation details into the caller. Any suggestions on how to go about this? > > In general, we avoid placing important kernel data-structures on slow > memory. That's one of the reasons why PMEM decided to mostly always use > ZONE_MOVABLE such that exactly what this patch does would not happen. So > I'm wondering if there would be demand for an additional toggle. > > Because even with memmap_on_memory enabled in general, you might not > want to do that for dax/kmem. > > IMHO, this patch should be dropped from your ppc64 series, as it's an > independent change that might be valuable for other architectures as well. > Sure thing, I can pick this back up and Aneesh can drop this from his set.