If you remember I did try an approach previously using a wait_for_device_probe() in the cxl_acpi driver. This didn't work because the wait would return before probe would complete of the CXL devices. From what I saw in the wait_for_device_probe() code is that it waits for drivers registered at the time it is called which ends up being before the other cxl drivers are registered. This was the reason to switch a deferred workqueue approach. I do agree that this can become a guessing game on how long to wait and is likely to not wait long enough for a given configuration. I'm open to suggestions for other approaches from anyone on determining when CXL device probe completes. > Some more comments below: > >> Signed-off-by: Nathan Fontenot <nathan.fontenot@xxxxxxx> >> --- >> arch/x86/kernel/e820.c | 17 ++++- >> drivers/cxl/core/region.c | 8 +- >> drivers/cxl/port.c | 15 ++++ >> drivers/dax/hmem/device.c | 13 ++-- >> drivers/dax/hmem/hmem.c | 15 ++++ >> drivers/dax/hmem/hmem.h | 11 +++ >> include/linux/dax.h | 4 - >> include/linux/ioport.h | 6 ++ >> kernel/resource.c | 155 +++++++++++++++++++++++++++++++++++++- >> 9 files changed, 229 insertions(+), 15 deletions(-) >> create mode 100644 drivers/dax/hmem/hmem.h >> >> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c >> index 4893d30ce438..cab82e9324a5 100644 >> --- a/arch/x86/kernel/e820.c >> +++ b/arch/x86/kernel/e820.c >> @@ -1210,14 +1210,23 @@ static unsigned long __init ram_alignment(resource_size_t pos) >> >> void __init e820__reserve_resources_late(void) >> { >> - int i; >> struct resource *res; >> + int i; >> >> + /* >> + * Prior to inserting SOFT_RESERVED resources we want to check for an >> + * intersection with potential CXL resources. Any SOFT_RESERVED resources >> + * that do intersect a potential CXL resource are set aside so they >> + * can be trimmed to accommodate CXL resource intersections and added to >> + * the iomem resource tree after the CXL drivers have completed their >> + * device probe. > > Perhaps shorten to "see hmem_register_devices() and cxl_acpi_probe() for > deferred initialization of Soft Reserved ranges" > >> + */ >> res = e820_res; >> - for (i = 0; i < e820_table->nr_entries; i++) { >> - if (!res->parent && res->end) >> + for (i = 0; i < e820_table->nr_entries; i++, res++) { >> + if (res->desc == IORES_DESC_SOFT_RESERVED) >> + insert_soft_reserve_resource(res); > > I would only expect this deferral to happen when CONFIG_DEV_DAX_HMEM > and/or CONFIG_CXL_REGION is enabled. It also needs to catch Soft > Reserved deferral on other, non-e820 based, archs. So, maybe this hackery > should be done internal to insert_resource_*(). Something like all > insert_resource() of IORES_DESC_SOFT_RESERVED is deferred until a flag > is flipped allowing future insertion attempts to succeed in adding them > to the ioresource_mem tree. > Good point on non-e820 archs. I can move the check insert_resource() and add checks for CONFIG_DEV_DAX_HMEM/CONFIG_CXL_REGION enablement. > Not that I expect this problem will ever effect more than just CXL, but > it is already the case that Soft Reserved is set for more than just CXL > ranges, and who know what other backend Soft Reserved consumer drivers > might arrive later. > > When CXL or HMEM parses the deferred entries they can take > responsibility for injecting the Soft Reserved entries. That achieves > continuity of the /proc/iomem contents across kernel versions while > giving those endpoint drivers the ability to unregister those resources. > >> + else if (!res->parent && res->end) >> insert_resource_expand_to_fit(&iomem_resource, res); >> - res++; >> } >> >> /* >> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c >> index 21ad5f242875..c458a6313b31 100644 >> --- a/drivers/cxl/core/region.c >> +++ b/drivers/cxl/core/region.c >> @@ -3226,6 +3226,12 @@ static int match_region_by_range(struct device *dev, void *data) >> return rc; >> } >> >> +static int insert_region_resource(struct resource *parent, struct resource *res) >> +{ >> + trim_soft_reserve_resources(res); >> + return insert_resource(parent, res); >> +} > > Per above, lets not do dynamic trimming, it's all or nothing CXL memory > enumeration if the driver is trying and failing to parse any part of the > BIOS-established CXL configuration. That can be done. I felt it was easier to trim the SR resources as CXL regions were created instead of going back and finding all the CXL regions after all device probe completed and trimming them. > > Yes, this could result in regressions in the other direction, but my > expectation is that the vast majority of CXL memory present at boot is > meant to be indistinguishable from DDR. In other words the current > default of "lose access to memory upon CXL enumeration failure that is > otherwise fully described by the EFI Memory Map" is the wrong default > policy. > >> + >> /* Establish an empty region covering the given HPA range */ >> static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd, >> struct cxl_endpoint_decoder *cxled) >> @@ -3272,7 +3278,7 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd, >> >> *res = DEFINE_RES_MEM_NAMED(hpa->start, range_len(hpa), >> dev_name(&cxlr->dev)); >> - rc = insert_resource(cxlrd->res, res); >> + rc = insert_region_resource(cxlrd->res, res); >> if (rc) { >> /* >> * Platform-firmware may not have split resources like "System >> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c >> index d7d5d982ce69..4461f2a80d72 100644 >> --- a/drivers/cxl/port.c >> +++ b/drivers/cxl/port.c >> @@ -89,6 +89,20 @@ static int cxl_switch_port_probe(struct cxl_port *port) >> return -ENXIO; >> } >> >> +static void cxl_sr_update(struct work_struct *w) >> +{ >> + merge_soft_reserve_resources(); >> +} >> + >> +DECLARE_DELAYED_WORK(cxl_sr_work, cxl_sr_update); >> + >> +static void schedule_soft_reserve_update(void) >> +{ >> + int timeout = 5 * HZ; >> + >> + mod_delayed_work(system_wq, &cxl_sr_work, timeout); >> +} > > For cases where there is Soft Reserved CXL backed memory it should be > sufficient to just wait for initial device probing to complete. So I > would just have cxl_acpi_probe() call wait_for_device_probe() in a > workqueue, rather than try to guess at a timeout. If anything, waiting > for driver core deferred probing timeout seems a good time to ask "are > we missing any CXL memory ranges?". > >> + >> static int cxl_endpoint_port_probe(struct cxl_port *port) >> { >> struct cxl_endpoint_dvsec_info info = { .port = port }; >> @@ -140,6 +154,7 @@ static int cxl_endpoint_port_probe(struct cxl_port *port) >> */ >> device_for_each_child(&port->dev, root, discover_region); >> >> + schedule_soft_reserve_update(); >> return 0; >> } >> >> diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c >> index f9e1a76a04a9..c45791ad4858 100644 >> --- a/drivers/dax/hmem/device.c >> +++ b/drivers/dax/hmem/device.c >> @@ -4,6 +4,7 @@ >> #include <linux/module.h> >> #include <linux/dax.h> >> #include <linux/mm.h> >> +#include "hmem.h" >> >> static bool nohmem; >> module_param_named(disable, nohmem, bool, 0444); >> @@ -17,6 +18,9 @@ static struct resource hmem_active = { >> .flags = IORESOURCE_MEM, >> }; >> >> +struct platform_device *hmem_pdev; >> +EXPORT_SYMBOL_GPL(hmem_pdev); >> + >> int walk_hmem_resources(struct device *host, walk_hmem_fn fn) >> { >> struct resource *res; >> @@ -35,7 +39,6 @@ EXPORT_SYMBOL_GPL(walk_hmem_resources); >> >> static void __hmem_register_resource(int target_nid, struct resource *res) >> { >> - struct platform_device *pdev; >> struct resource *new; >> int rc; >> >> @@ -51,15 +54,15 @@ static void __hmem_register_resource(int target_nid, struct resource *res) >> if (platform_initialized) >> return; >> >> - pdev = platform_device_alloc("hmem_platform", 0); >> - if (!pdev) { >> + hmem_pdev = platform_device_alloc("hmem_platform", 0); >> + if (!hmem_pdev) { >> pr_err_once("failed to register device-dax hmem_platform device\n"); >> return; >> } >> >> - rc = platform_device_add(pdev); >> + rc = platform_device_add(hmem_pdev); >> if (rc) >> - platform_device_put(pdev); >> + platform_device_put(hmem_pdev); >> else >> platform_initialized = true; > > So, I don't think anyone actually cares which device parents a dax > device. It would be cleaner if cxl_acpi registered the Soft Reserved dax > devices that the hmem driver was told to skip. > > That change eliminates the need for a notifier to trigger the hmem > driver to add devices after a CXL enumeration failure. ok, I do like this better than the addition of a notification chain added to kernel/resource.c for what felt like a one time notification. -Nathan > > [ .. trim all the fine grained resource handling and notifier code .. ] > > The end result of this effort is that the Linux CXL subsystem will > aggressively complain and refuse to run with platforms and devices that > deviate from common expectations. That gives space for Soft Reserved > generic support to fill some gaps while quirks, hacks, and workarounds > are developed to compensate for these deviations. Otherwise it has been > a constant drip of "what in the world is that platform doing?", and the > current policy of "try to depend on standard CXL enumeration" is too > fragile.