Nathan Fontenot wrote: > Update handling of SOFT RESERVE iomem resources that intersect with > CXL region resources to remove the intersections from the SOFT RESERVE > resources. The current approach of leaving the SOFT RESERVE > resource as is can cause failures during hotplug replace of CXL > devices because the resource is not available for reuse after > teardown of the CXL device. > > The approach is to trim out any pieces of SOFT RESERVE resources > that intersect CXL regions. To do this, first set aside any SOFT RESERVE > resources that intersect with a CFMWS into a separate resource tree > during e820__reserve_resources_late() that would have been otherwise > added to the iomem resource tree. > > As CXL regions are created the cxl resource created for the new > region is used to trim intersections from the SOFT RESERVE > resources that were previously set aside. > > Once CXL device probe has completed ant remaining SOFT RESERVE resources > remaining are added to the iomem resource tree. As each resource > is added to the oiomem resource tree a new notifier chain is invoked > to notify the dax driver of newly added SOFT RESERVE resources so that > the dax driver can consume them. Hi Nathan, this patch hit on all the mechanisms I would expect, but upon reading it there is an opportunity to zoom out and do something blunter than the surgical precision of this current proposal. In other words, I appreciate the consideration of potential corner cases, but for overall maintainability this should aim to be an all or nothing approach. Specifically, at the first sign of trouble, any CXL sub-driver probe failure or region enumeration timeout, that the entire CXL topology be torn down (trigger the equivalent of ->remove() on the ACPI0017 device), and the deferred Soft Reserved ranges registered as if cxl_acpi was not present (implement a fallback equivalent to hmem_register_devices()). No need to trim resources as regions arrive, just tear down everything setup in the cxl_acpi_probe() path with devres_release_all(). So, I am thinking export a flag from the CXL core that indicates whether any conflict with platform-firmware established CXL regions has occurred. Read that flag from an cxl_acpi-driver-launched deferred workqueue that is awaiting initial device probing to quiesce. If that flag indicates a CXL enumeration failure then trigger devres_release_all() on the ACPI0017 platform device and follow that up by walking the deferred Soft Reserve resources to register raw (unparented by CXL regions) dax devices. Some more comments below: > Signed-off-by: Nathan Fontenot <nathan.fontenot@xxxxxxx> > --- > arch/x86/kernel/e820.c | 17 ++++- > drivers/cxl/core/region.c | 8 +- > drivers/cxl/port.c | 15 ++++ > drivers/dax/hmem/device.c | 13 ++-- > drivers/dax/hmem/hmem.c | 15 ++++ > drivers/dax/hmem/hmem.h | 11 +++ > include/linux/dax.h | 4 - > include/linux/ioport.h | 6 ++ > kernel/resource.c | 155 +++++++++++++++++++++++++++++++++++++- > 9 files changed, 229 insertions(+), 15 deletions(-) > create mode 100644 drivers/dax/hmem/hmem.h > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > index 4893d30ce438..cab82e9324a5 100644 > --- a/arch/x86/kernel/e820.c > +++ b/arch/x86/kernel/e820.c > @@ -1210,14 +1210,23 @@ static unsigned long __init ram_alignment(resource_size_t pos) > > void __init e820__reserve_resources_late(void) > { > - int i; > struct resource *res; > + int i; > > + /* > + * Prior to inserting SOFT_RESERVED resources we want to check for an > + * intersection with potential CXL resources. Any SOFT_RESERVED resources > + * that do intersect a potential CXL resource are set aside so they > + * can be trimmed to accommodate CXL resource intersections and added to > + * the iomem resource tree after the CXL drivers have completed their > + * device probe. Perhaps shorten to "see hmem_register_devices() and cxl_acpi_probe() for deferred initialization of Soft Reserved ranges" > + */ > res = e820_res; > - for (i = 0; i < e820_table->nr_entries; i++) { > - if (!res->parent && res->end) > + for (i = 0; i < e820_table->nr_entries; i++, res++) { > + if (res->desc == IORES_DESC_SOFT_RESERVED) > + insert_soft_reserve_resource(res); I would only expect this deferral to happen when CONFIG_DEV_DAX_HMEM and/or CONFIG_CXL_REGION is enabled. It also needs to catch Soft Reserved deferral on other, non-e820 based, archs. So, maybe this hackery should be done internal to insert_resource_*(). Something like all insert_resource() of IORES_DESC_SOFT_RESERVED is deferred until a flag is flipped allowing future insertion attempts to succeed in adding them to the ioresource_mem tree. Not that I expect this problem will ever effect more than just CXL, but it is already the case that Soft Reserved is set for more than just CXL ranges, and who know what other backend Soft Reserved consumer drivers might arrive later. When CXL or HMEM parses the deferred entries they can take responsibility for injecting the Soft Reserved entries. That achieves continuity of the /proc/iomem contents across kernel versions while giving those endpoint drivers the ability to unregister those resources. > + else if (!res->parent && res->end) > insert_resource_expand_to_fit(&iomem_resource, res); > - res++; > } > > /* > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > index 21ad5f242875..c458a6313b31 100644 > --- a/drivers/cxl/core/region.c > +++ b/drivers/cxl/core/region.c > @@ -3226,6 +3226,12 @@ static int match_region_by_range(struct device *dev, void *data) > return rc; > } > > +static int insert_region_resource(struct resource *parent, struct resource *res) > +{ > + trim_soft_reserve_resources(res); > + return insert_resource(parent, res); > +} Per above, lets not do dynamic trimming, it's all or nothing CXL memory enumeration if the driver is trying and failing to parse any part of the BIOS-established CXL configuration. Yes, this could result in regressions in the other direction, but my expectation is that the vast majority of CXL memory present at boot is meant to be indistinguishable from DDR. In other words the current default of "lose access to memory upon CXL enumeration failure that is otherwise fully described by the EFI Memory Map" is the wrong default policy. > + > /* Establish an empty region covering the given HPA range */ > static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd, > struct cxl_endpoint_decoder *cxled) > @@ -3272,7 +3278,7 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd, > > *res = DEFINE_RES_MEM_NAMED(hpa->start, range_len(hpa), > dev_name(&cxlr->dev)); > - rc = insert_resource(cxlrd->res, res); > + rc = insert_region_resource(cxlrd->res, res); > if (rc) { > /* > * Platform-firmware may not have split resources like "System > diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c > index d7d5d982ce69..4461f2a80d72 100644 > --- a/drivers/cxl/port.c > +++ b/drivers/cxl/port.c > @@ -89,6 +89,20 @@ static int cxl_switch_port_probe(struct cxl_port *port) > return -ENXIO; > } > > +static void cxl_sr_update(struct work_struct *w) > +{ > + merge_soft_reserve_resources(); > +} > + > +DECLARE_DELAYED_WORK(cxl_sr_work, cxl_sr_update); > + > +static void schedule_soft_reserve_update(void) > +{ > + int timeout = 5 * HZ; > + > + mod_delayed_work(system_wq, &cxl_sr_work, timeout); > +} For cases where there is Soft Reserved CXL backed memory it should be sufficient to just wait for initial device probing to complete. So I would just have cxl_acpi_probe() call wait_for_device_probe() in a workqueue, rather than try to guess at a timeout. If anything, waiting for driver core deferred probing timeout seems a good time to ask "are we missing any CXL memory ranges?". > + > static int cxl_endpoint_port_probe(struct cxl_port *port) > { > struct cxl_endpoint_dvsec_info info = { .port = port }; > @@ -140,6 +154,7 @@ static int cxl_endpoint_port_probe(struct cxl_port *port) > */ > device_for_each_child(&port->dev, root, discover_region); > > + schedule_soft_reserve_update(); > return 0; > } > > diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c > index f9e1a76a04a9..c45791ad4858 100644 > --- a/drivers/dax/hmem/device.c > +++ b/drivers/dax/hmem/device.c > @@ -4,6 +4,7 @@ > #include <linux/module.h> > #include <linux/dax.h> > #include <linux/mm.h> > +#include "hmem.h" > > static bool nohmem; > module_param_named(disable, nohmem, bool, 0444); > @@ -17,6 +18,9 @@ static struct resource hmem_active = { > .flags = IORESOURCE_MEM, > }; > > +struct platform_device *hmem_pdev; > +EXPORT_SYMBOL_GPL(hmem_pdev); > + > int walk_hmem_resources(struct device *host, walk_hmem_fn fn) > { > struct resource *res; > @@ -35,7 +39,6 @@ EXPORT_SYMBOL_GPL(walk_hmem_resources); > > static void __hmem_register_resource(int target_nid, struct resource *res) > { > - struct platform_device *pdev; > struct resource *new; > int rc; > > @@ -51,15 +54,15 @@ static void __hmem_register_resource(int target_nid, struct resource *res) > if (platform_initialized) > return; > > - pdev = platform_device_alloc("hmem_platform", 0); > - if (!pdev) { > + hmem_pdev = platform_device_alloc("hmem_platform", 0); > + if (!hmem_pdev) { > pr_err_once("failed to register device-dax hmem_platform device\n"); > return; > } > > - rc = platform_device_add(pdev); > + rc = platform_device_add(hmem_pdev); > if (rc) > - platform_device_put(pdev); > + platform_device_put(hmem_pdev); > else > platform_initialized = true; So, I don't think anyone actually cares which device parents a dax device. It would be cleaner if cxl_acpi registered the Soft Reserved dax devices that the hmem driver was told to skip. That change eliminates the need for a notifier to trigger the hmem driver to add devices after a CXL enumeration failure. [ .. trim all the fine grained resource handling and notifier code .. ] The end result of this effort is that the Linux CXL subsystem will aggressively complain and refuse to run with platforms and devices that deviate from common expectations. That gives space for Soft Reserved generic support to fill some gaps while quirks, hacks, and workarounds are developed to compensate for these deviations. Otherwise it has been a constant drip of "what in the world is that platform doing?", and the current policy of "try to depend on standard CXL enumeration" is too fragile.