On 08.09.20 17:33, Joao Martins wrote: > [Sorry for the late response] > > On 8/21/20 11:06 AM, David Hildenbrand wrote: >> On 03.08.20 07:03, Dan Williams wrote: >>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev) >>> * could be mixed in a node with faster memory, causing >>> * unavoidable performance issues. >>> */ >>> - numa_node = dev_dax->target_node; >>> if (numa_node < 0) { >>> dev_warn(dev, "rejecting DAX region with invalid node: %d\n", >>> numa_node); >>> return -EINVAL; >>> } >>> >>> - /* Hotplug starting at the beginning of the next block: */ >>> - kmem_start = ALIGN(range->start, memory_block_size_bytes()); >>> - >>> - kmem_size = range_len(range); >>> - /* Adjust the size down to compensate for moving up kmem_start: */ >>> - kmem_size -= kmem_start - range->start; >>> - /* Align the size down to cover only complete blocks: */ >>> - kmem_size &= ~(memory_block_size_bytes() - 1); >>> - kmem_end = kmem_start + kmem_size; >>> - >>> - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL); >>> - if (!new_res_name) >>> + res_name = kstrdup(dev_name(dev), GFP_KERNEL); >>> + if (!res_name) >>> return -ENOMEM; >>> >>> - /* Region is permanently reserved if hotremove fails. */ >>> - new_res = request_mem_region(kmem_start, kmem_size, new_res_name); >>> - if (!new_res) { >>> - dev_warn(dev, "could not reserve region [%pa-%pa]\n", >>> - &kmem_start, &kmem_end); >>> - kfree(new_res_name); >>> + res = request_mem_region(range.start, range_len(&range), res_name); >> >> I think our range could be empty after aligning. I assume >> request_mem_region() would check that, but maybe we could report a >> better error/warning in that case. >> > dax_kmem_range() already returns a memory-block-aligned @range but > IIUC request_mem_region() isn't checking for that. Having said that > the returned @res wouldn't be different from the passed range.start. > >>> /* >>> * Ensure that future kexec'd kernels will not treat this as RAM >>> * automatically. >>> */ >>> - rc = add_memory_driver_managed(numa_node, new_res->start, >>> - resource_size(new_res), kmem_name); >>> + rc = add_memory_driver_managed(numa_node, res->start, >>> + resource_size(res), kmem_name); >>> + >>> + res->flags |= IORESOURCE_BUSY; >> >> Hm, I don't think that's correct. Any specific reason why to mark the >> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could >> suddenly stumble over it - and e.g., similarly kexec code when trying to >> find memory for placing kexec images. I think we should leave this >> !BUSY, just as it is right now. >> > Agreed. > >>> if (rc) { >>> - release_resource(new_res); >>> - kfree(new_res); >>> - kfree(new_res_name); >>> + release_mem_region(range.start, range_len(&range)); >>> + kfree(res_name); >>> return rc; >>> } >>> - dev_dax->dax_kmem_res = new_res; >>> + >>> + dev_set_drvdata(dev, res_name); >>> >>> return 0; >>> } >>> >>> #ifdef CONFIG_MEMORY_HOTREMOVE >>> -static int dev_dax_kmem_remove(struct device *dev) >>> +static void dax_kmem_release(struct dev_dax *dev_dax) >>> { >>> - struct dev_dax *dev_dax = to_dev_dax(dev); >>> - struct resource *res = dev_dax->dax_kmem_res; >>> - resource_size_t kmem_start = res->start; >>> - resource_size_t kmem_size = resource_size(res); >>> - const char *res_name = res->name; >>> int rc; >>> + struct device *dev = &dev_dax->dev; >>> + const char *res_name = dev_get_drvdata(dev); >>> + struct range range = dax_kmem_range(dev_dax); >>> >>> /* >>> * We have one shot for removing memory, if some memory blocks were not >>> * offline prior to calling this function remove_memory() will fail, and >>> * there is no way to hotremove this memory until reboot because device >>> - * unbind will succeed even if we return failure. >>> + * unbind will proceed regardless of the remove_memory result. >>> */ >>> - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size); >>> - if (rc) { >>> - any_hotremove_failed = true; >>> - dev_err(dev, >>> - "DAX region %pR cannot be hotremoved until the next reboot\n", >>> - res); >>> - return rc; >>> + rc = remove_memory(dev_dax->target_node, range.start, range_len(&range)); >>> + if (rc == 0) { >> >> if (!rc) ? >> > Better off would be to keep the old order: > > if (rc) { > any_hotremove_failed = true; > dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n", > range.start, range.end); > return; > } > > release_mem_region(range.start, range_len(&range)); > dev_set_drvdata(dev, NULL); > kfree(res_name); > return; > > >>> + release_mem_region(range.start, range_len(&range)); >> >> remove_memory() does a release_mem_region_adjustable(). Don't you >> actually want to release the *unaligned* region you requested? >> > Isn't it what we're doing here? > (The release_mem_region_adjustable() is using the same > dax_kmem-aligned range and there's no split/adjust) > > Meaning right now (+ parent marked as !BUSY), and if I am understanding > this correctly: > > request_mem_region(range.start, range_len) > __request_region(iomem_res, range.start, range_len) -> alloc @parent > add_memory_driver_managed(parent.start, resource_size(parent)) > __request_region(parent.start, resource_size(parent)) -> alloc @child > > [...] > > remove_memory(range.start, range_len) > request_mem_region_adjustable(range.start, range_len) > __release_region(range.start, range_len) -> remove @child > > release_mem_region(range.start, range_len) > __release_region(range.start, range_len) -> doesn't remove @parent because !BUSY? > > The add/removal of this relies on !BUSY. But now I am wondering if the parent remaining > unreleased is deliberate even on CONFIG_MEMORY_HOTREMOVE=y. > > Joao > Thinking about it, if we don't set the parent resource BUSY (which is what I think is the right way of doing things), and don't want to store the parent resource pointer, we could add something like lookup_resource() - e.g., lookup_mem_resource() - , however, searching properly in the whole hierarchy (instead of only the first level), and traversing down to the last hierarchy. Then it would be as simple as remove_memory(range.start, range_len) res = lookup_mem_resource(range.start); release_resource(res); -- Thanks, David / dhildenb