Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

David Hildenbrand <david@xxxxxxxxxx> · Thu, 13 Jul 2023 09:23:22 +0200

On 13.07.23 08:45, Verma, Vishal L wrote:
On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote:
On 11.07.23 16:30, Aneesh Kumar K.V wrote:
David Hildenbrand <david@xxxxxxxxxx> writes:

Maybe the better alternative is teach
add_memory_resource()/try_remove_memory() to do that internally.

In the add_memory_resource() case, it might be a loop around that
memmap_on_memory + arch_add_memory code path (well, and the error path
also needs adjustment):

         /*
          * Self hosted memmap array
          */
         if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
                 if (!mhp_supports_memmap_on_memory(size)) {
                         ret = -EINVAL;
                         goto error;
                 }
                 mhp_altmap.free = PHYS_PFN(size);
                 mhp_altmap.base_pfn = PHYS_PFN(start);
                 params.altmap = &mhp_altmap;
         }

         /* call arch's memory hotadd */
         ret = arch_add_memory(nid, start, size, &params);
         if (ret < 0)
                 goto error;

Note that we want to handle that on a per memory-block basis, because we
don't want the vmemmap of memory block #2 to end up on memory block #1.
It all gets messy with memory onlining/offlining etc otherwise ...

I tried to implement this inside add_memory_driver_managed() and also
within dax/kmem. IMHO doing the error handling inside dax/kmem is
better. Here is how it looks:

1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
2. For each succesful request_mem_regions if any blocks got added, we
keep the resource. If none got added, we will kfree the resource

Doing this unconditional splitting outside of
add_memory_driver_managed() is undesirable for at least two reasons

1) You end up always creating individual entries in the resource tree
     (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective.
2) As we call arch_add_memory() in memory block granularity (e.g., 128
     MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the
     identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective.

While you could sense for support and do the split based on that, it
will be beneficial for other users (especially DIMMs) if we do that
internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be
effective or not.

I'm taking a shot at implementing the splitting internally in
memory_hotplug.c. The caller (kmem) side does become trivial with this
approach, but there's a slight complication if I don't have the module
param override (patch 1 of this series).

The kmem diff now looks like:

    diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
    index 898ca9505754..8be932f63f90 100644
    --- a/drivers/dax/kmem.c
    +++ b/drivers/dax/kmem.c
    @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
            data->mgid = rc;

            for (i = 0; i < dev_dax->nr_range; i++) {
    +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
    +                                 MHP_SPLIT_MEMBLOCKS;
                    struct resource *res;
                    struct range range;

    @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
                     * this as RAM automatically.
                     */
                    rc = add_memory_driver_managed(data->mgid, range.start,
    -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
    +                               range_len(&range), kmem_name, mhp_flags);

                    if (rc) {
                            dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",

Why do we need the MHP_SPLIT_MEMBLOCKS?

In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a
single memory block, you can simply split up internally, no?

Essentially in add_memory_resource() something like

if (mhp_flags & MHP_MEMMAP_ON_MEMORY &&
    mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
	for (cur_start = start, cur_start < start + size;
	     cur_start += memory_block_size_bytes()) {
		mhp_altmap.free = PHYS_PFN(memory_block_size_bytes());
		mhp_altmap.base_pfn = PHYS_PFN(start);
		params.altmap = &mhp_altmap;

		ret = arch_add_memory(nid, start,
				      memory_block_size_bytes(), &params);
		if (ret < 0) ...

		ret = create_memory_block_devices(start, memory_block_size_bytes(),
						  mhp_altmap.alloc, group);
		if (ret) ...

	}
} else {
	/* old boring stuff */
}

Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into
a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes()
...

If any adding of memory failed, we remove what we already added. That works, because as
long as we're holding the relevant locks, memory cannot get onlined in the meantime.

Then we only have to teach remove_memory() to deal with individual blocks when finding
blocks that have an altmap.

--
Cheers,

David / dhildenb