On Thu, Sep 24, 2020 at 12:26 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 23.09.20 23:41, Dan Williams wrote: > > On Wed, Sep 23, 2020 at 1:04 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > >> > >> On 08.09.20 17:33, Joao Martins wrote: > >>> [Sorry for the late response] > >>> > >>> On 8/21/20 11:06 AM, David Hildenbrand wrote: > >>>> On 03.08.20 07:03, Dan Williams wrote: > >>>>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev) > >>>>> * could be mixed in a node with faster memory, causing > >>>>> * unavoidable performance issues. > >>>>> */ > >>>>> - numa_node = dev_dax->target_node; > >>>>> if (numa_node < 0) { > >>>>> dev_warn(dev, "rejecting DAX region with invalid node: %d\n", > >>>>> numa_node); > >>>>> return -EINVAL; > >>>>> } > >>>>> > >>>>> - /* Hotplug starting at the beginning of the next block: */ > >>>>> - kmem_start = ALIGN(range->start, memory_block_size_bytes()); > >>>>> - > >>>>> - kmem_size = range_len(range); > >>>>> - /* Adjust the size down to compensate for moving up kmem_start: */ > >>>>> - kmem_size -= kmem_start - range->start; > >>>>> - /* Align the size down to cover only complete blocks: */ > >>>>> - kmem_size &= ~(memory_block_size_bytes() - 1); > >>>>> - kmem_end = kmem_start + kmem_size; > >>>>> - > >>>>> - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL); > >>>>> - if (!new_res_name) > >>>>> + res_name = kstrdup(dev_name(dev), GFP_KERNEL); > >>>>> + if (!res_name) > >>>>> return -ENOMEM; > >>>>> > >>>>> - /* Region is permanently reserved if hotremove fails. */ > >>>>> - new_res = request_mem_region(kmem_start, kmem_size, new_res_name); > >>>>> - if (!new_res) { > >>>>> - dev_warn(dev, "could not reserve region [%pa-%pa]\n", > >>>>> - &kmem_start, &kmem_end); > >>>>> - kfree(new_res_name); > >>>>> + res = request_mem_region(range.start, range_len(&range), res_name); > >>>> > >>>> I think our range could be empty after aligning. I assume > >>>> request_mem_region() would check that, but maybe we could report a > >>>> better error/warning in that case. > >>>> > >>> dax_kmem_range() already returns a memory-block-aligned @range but > >>> IIUC request_mem_region() isn't checking for that. Having said that > >>> the returned @res wouldn't be different from the passed range.start. > >>> > >>>>> /* > >>>>> * Ensure that future kexec'd kernels will not treat this as RAM > >>>>> * automatically. > >>>>> */ > >>>>> - rc = add_memory_driver_managed(numa_node, new_res->start, > >>>>> - resource_size(new_res), kmem_name); > >>>>> + rc = add_memory_driver_managed(numa_node, res->start, > >>>>> + resource_size(res), kmem_name); > >>>>> + > >>>>> + res->flags |= IORESOURCE_BUSY; > >>>> > >>>> Hm, I don't think that's correct. Any specific reason why to mark the > >>>> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could > >>>> suddenly stumble over it - and e.g., similarly kexec code when trying to > >>>> find memory for placing kexec images. I think we should leave this > >>>> !BUSY, just as it is right now. > >>>> > >>> Agreed. > >>> > >>>>> if (rc) { > >>>>> - release_resource(new_res); > >>>>> - kfree(new_res); > >>>>> - kfree(new_res_name); > >>>>> + release_mem_region(range.start, range_len(&range)); > >>>>> + kfree(res_name); > >>>>> return rc; > >>>>> } > >>>>> - dev_dax->dax_kmem_res = new_res; > >>>>> + > >>>>> + dev_set_drvdata(dev, res_name); > >>>>> > >>>>> return 0; > >>>>> } > >>>>> > >>>>> #ifdef CONFIG_MEMORY_HOTREMOVE > >>>>> -static int dev_dax_kmem_remove(struct device *dev) > >>>>> +static void dax_kmem_release(struct dev_dax *dev_dax) > >>>>> { > >>>>> - struct dev_dax *dev_dax = to_dev_dax(dev); > >>>>> - struct resource *res = dev_dax->dax_kmem_res; > >>>>> - resource_size_t kmem_start = res->start; > >>>>> - resource_size_t kmem_size = resource_size(res); > >>>>> - const char *res_name = res->name; > >>>>> int rc; > >>>>> + struct device *dev = &dev_dax->dev; > >>>>> + const char *res_name = dev_get_drvdata(dev); > >>>>> + struct range range = dax_kmem_range(dev_dax); > >>>>> > >>>>> /* > >>>>> * We have one shot for removing memory, if some memory blocks were not > >>>>> * offline prior to calling this function remove_memory() will fail, and > >>>>> * there is no way to hotremove this memory until reboot because device > >>>>> - * unbind will succeed even if we return failure. > >>>>> + * unbind will proceed regardless of the remove_memory result. > >>>>> */ > >>>>> - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size); > >>>>> - if (rc) { > >>>>> - any_hotremove_failed = true; > >>>>> - dev_err(dev, > >>>>> - "DAX region %pR cannot be hotremoved until the next reboot\n", > >>>>> - res); > >>>>> - return rc; > >>>>> + rc = remove_memory(dev_dax->target_node, range.start, range_len(&range)); > >>>>> + if (rc == 0) { > >>>> > >>>> if (!rc) ? > >>>> > >>> Better off would be to keep the old order: > >>> > >>> if (rc) { > >>> any_hotremove_failed = true; > >>> dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n", > >>> range.start, range.end); > >>> return; > >>> } > >>> > >>> release_mem_region(range.start, range_len(&range)); > >>> dev_set_drvdata(dev, NULL); > >>> kfree(res_name); > >>> return; > >>> > >>> > >>>>> + release_mem_region(range.start, range_len(&range)); > >>>> > >>>> remove_memory() does a release_mem_region_adjustable(). Don't you > >>>> actually want to release the *unaligned* region you requested? > >>>> > >>> Isn't it what we're doing here? > >>> (The release_mem_region_adjustable() is using the same > >>> dax_kmem-aligned range and there's no split/adjust) > >>> > >>> Meaning right now (+ parent marked as !BUSY), and if I am understanding > >>> this correctly: > >>> > >>> request_mem_region(range.start, range_len) > >>> __request_region(iomem_res, range.start, range_len) -> alloc @parent > >>> add_memory_driver_managed(parent.start, resource_size(parent)) > >>> __request_region(parent.start, resource_size(parent)) -> alloc @child > >>> > >>> [...] > >>> > >>> remove_memory(range.start, range_len) > >>> request_mem_region_adjustable(range.start, range_len) > >>> __release_region(range.start, range_len) -> remove @child > >>> > >>> release_mem_region(range.start, range_len) > >>> __release_region(range.start, range_len) -> doesn't remove @parent because !BUSY? > >>> > >>> The add/removal of this relies on !BUSY. But now I am wondering if the parent remaining > >>> unreleased is deliberate even on CONFIG_MEMORY_HOTREMOVE=y. > >>> > >>> Joao > >>> > >> > >> Thinking about it, if we don't set the parent resource BUSY (which is > >> what I think is the right way of doing things), and don't want to store > >> the parent resource pointer, we could add something like > >> lookup_resource() - e.g., lookup_mem_resource() - , however, searching > >> properly in the whole hierarchy (instead of only the first level), and > >> traversing down to the last hierarchy. Then it would be as simple as > >> > >> remove_memory(range.start, range_len) > >> res = lookup_mem_resource(range.start); > >> release_resource(res); > > > > Another thought... I notice that you've taught > > register_memory_resource() a IORESOURCE_MEM_DRIVER_MANAGED special > > case. Lets just make the assumption of add_memory_driver_managed() > > that it is the driver's responsibility to mark the range busy before > > calling, and the driver's responsibility to release the region. I.e. > > validate (rather than request) that the range is busy in > > register_memory_resource(), and teach release_memory_resource() to > > skip releasing the region when the memory is marked driver managed. > > That would let dax_kmem drop its manipulation of the 'busy' flag which > > is a layering violation no matter how many comments we put around it. > > IIUC, that won't work for virtio-mem, whereby the parent resource spans > multiple possible (future) add_memory_driver_managed() calls and is > (just like for kmem) a pure indication to which device memory ranges belong. > > For example, when exposing 2GB via a virtio-mem device with max 4GB: > > (/proc/iomem) > 240000000-33fffffff : virtio0 > 240000000-2bfffffff : System RAM (virtio_mem) > > And after hotplugging additional 2GB: > > 240000000-33fffffff : virtio0 > 240000000-33fffffff : System RAM (virtio_mem) > > So marking "virtio0" always BUSY (especially right from the start) would > be wrong. I'm not suggesting to busy the whole "virtio" range, just the portion that's about to be passed to add_memory_driver_managed(). > The assumption is that anything that's IORESOURCE_SYSTEM_RAM > and IORESOUCE_BUSY is currently added to the system as system RAM (e.g., > after add_memory() and friends, or during boot). > > I do agree that manually clearing the busy flag is ugly. What we most > probably want is request_mem_region() that performs similar checks (no > overlaps with existing BUSY resources), but doesn't set the region busy. > I can't see that working without some way to export and hold the resource lock until some agent can atomically claim the range.