Re: [PATCH] cxl: Update Soft Reserved resources upon region creation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Fontenot, Nathan wrote:
> 
> 
> On 12/11/2024 4:30 PM, Dan Williams wrote:
> > Nathan Fontenot wrote:
> >> Update handling of SOFT RESERVE iomem resources that intersect with
> >> CXL region resources to remove the intersections from the SOFT RESERVE
> >> resources. The current approach of leaving the SOFT RESERVE
> >> resource as is can cause failures during hotplug replace of CXL
> >> devices because the resource is not available for reuse after
> >> teardown of the CXL device.
> >>
> >> The approach is to trim out any pieces of SOFT RESERVE resources
> >> that intersect CXL regions. To do this, first set aside any SOFT RESERVE
> >> resources that intersect with a CFMWS into a separate resource tree
> >> during e820__reserve_resources_late() that would have been otherwise
> >> added to the iomem resource tree.
> >>
> >> As CXL regions are created the cxl resource created for the new
> >> region is used to trim intersections from the SOFT RESERVE
> >> resources that were previously set aside.
> >>
> >> Once CXL device probe has completed ant remaining SOFT RESERVE resources
> >> remaining are added to the iomem resource tree. As each resource
> >> is added to the oiomem resource tree a new notifier chain is invoked
> >> to notify the dax driver of newly added SOFT RESERVE resources so that
> >> the dax driver can consume them.
> > 
> > Hi Nathan, this patch hit on all the mechanisms I would expect, but upon
> > reading it there is an opportunity to zoom out and do something blunter
> > than the surgical precision of this current proposal.
> > 
> > In other words, I appreciate the consideration of potential corner
> > cases, but for overall maintainability this should aim to be an all or
> > nothing approach.
> > 
> > Specifically, at the first sign of trouble, any CXL sub-driver probe
> > failure or region enumeration timeout, that the entire CXL topology be
> > torn down (trigger the equivalent of ->remove() on the ACPI0017 device),
> > and the deferred Soft Reserved ranges registered as if cxl_acpi was not
> > present (implement a fallback equivalent to hmem_register_devices()).
> > 
> > No need to trim resources as regions arrive, just tear down everything
> > setup in the cxl_acpi_probe() path with devres_release_all().
> > 
> > So, I am thinking export a flag from the CXL core that indicates whether
> > any conflict with platform-firmware established CXL regions has
> > occurred.
> > 
> > Read that flag from an cxl_acpi-driver-launched deferred workqueue that
> > is awaiting initial device probing to quiesce. If that flag indicates a
> > CXL enumeration failure then trigger devres_release_all() on the
> > ACPI0017 platform device and follow that up by walking the deferred Soft
> > Reserve resources to register raw (unparented by CXL regions) dax
> > devices.
> > 
> 
> From my initial reading of this are you suggesting we tear down all
> CXL devices for any single failure? That seems a bit extreme. If I'm
> reading that wrong let me know.

I would say "comprehensive" rather than extreme. It could be softened a
bit to say that any incomplete or unhandled soft-reserved ranges get
returned to pure dax operation.

Otherwise the fact that this patch as is did not resolve the recent
regression report [1] is a sign that it needs to be more aggressive.

So if a softened approach can be done with low complexity, lets do it,
but the status quo is already extreme: end users are losing access to
their memory because the CXL subsystem fails to interpret the
configuration.

I.e. the minimum requirement for me is that all Soft Reserved ranges end
up as dax-devices.

[1]: http://lore.kernel.org/d8d2c310-2021-431f-adbe-71ad0a17896a@xxxxxxx

> If you remember I did try an approach previously using a wait_for_device_probe()
> in the cxl_acpi driver. This didn't work because the wait would return before
> probe would complete of the CXL devices. From what I saw in the
> wait_for_device_probe() code is that it waits for drivers registered at the
> time it is called which ends up being before the other cxl drivers are registered.

Right, I think that's solvable, especially now that some of the init
fixes have gone in [2] where cxl_acpi knows that the cxl_port driver has
had a chance to load.

It might require a gate like:

    cxl_acpi_flush_probe(...) {
    
        wait_event(<cxl_pci && cxl_mem have signaled that they have loaded>)
    
        wait_for_device_probe()
        ....
    }

[2]: http://lore.kernel.org/172964779333.81806.8852577918216421011.stgit@xxxxxxxxxxxxxxxxxxxxxxxxx

> This was the reason to switch a deferred workqueue approach. I do agree that
> this can become a guessing game on how long to wait and is likely to not wait
> long enough for a given configuration.
> 
> I'm open to suggestions for other approaches from anyone on determining
> when CXL device probe completes.
> 
> > Some more comments below:
> > 
> >> Signed-off-by: Nathan Fontenot <nathan.fontenot@xxxxxxx>
> >> ---
> >>  arch/x86/kernel/e820.c    |  17 ++++-
> >>  drivers/cxl/core/region.c |   8 +-
> >>  drivers/cxl/port.c        |  15 ++++
> >>  drivers/dax/hmem/device.c |  13 ++--
> >>  drivers/dax/hmem/hmem.c   |  15 ++++
> >>  drivers/dax/hmem/hmem.h   |  11 +++
> >>  include/linux/dax.h       |   4 -
> >>  include/linux/ioport.h    |   6 ++
> >>  kernel/resource.c         | 155 +++++++++++++++++++++++++++++++++++++-
> >>  9 files changed, 229 insertions(+), 15 deletions(-)
> >>  create mode 100644 drivers/dax/hmem/hmem.h
> >>
> >> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> >> index 4893d30ce438..cab82e9324a5 100644
> >> --- a/arch/x86/kernel/e820.c
> >> +++ b/arch/x86/kernel/e820.c
> >> @@ -1210,14 +1210,23 @@ static unsigned long __init ram_alignment(resource_size_t pos)
> >>  
> >>  void __init e820__reserve_resources_late(void)
> >>  {
> >> -	int i;
> >>  	struct resource *res;
> >> +	int i;
> >>  
> >> +	/*
> >> +	 * Prior to inserting SOFT_RESERVED resources we want to check for an
> >> +	 * intersection with potential CXL resources. Any SOFT_RESERVED resources
> >> +	 * that do intersect a potential CXL resource are set aside so they
> >> +	 * can be trimmed to accommodate CXL resource intersections and added to
> >> +	 * the iomem resource tree after the CXL drivers have completed their
> >> +	 * device probe.
> > 
> > Perhaps shorten to "see hmem_register_devices() and cxl_acpi_probe() for
> > deferred initialization of Soft Reserved ranges"
> > 
> >> +	 */
> >>  	res = e820_res;
> >> -	for (i = 0; i < e820_table->nr_entries; i++) {
> >> -		if (!res->parent && res->end)
> >> +	for (i = 0; i < e820_table->nr_entries; i++, res++) {
> >> +		if (res->desc == IORES_DESC_SOFT_RESERVED)
> >> +			insert_soft_reserve_resource(res);
> > 
> > I would only expect this deferral to happen when CONFIG_DEV_DAX_HMEM
> > and/or CONFIG_CXL_REGION  is enabled. It also needs to catch Soft
> > Reserved deferral on other, non-e820 based, archs. So, maybe this hackery
> > should be done internal to insert_resource_*(). Something like all
> > insert_resource() of IORES_DESC_SOFT_RESERVED is deferred until a flag
> > is flipped allowing future insertion attempts to succeed in adding them
> > to the ioresource_mem tree.
> >
> 
> Good point on non-e820 archs. I can move the check insert_resource() and
> add checks for CONFIG_DEV_DAX_HMEM/CONFIG_CXL_REGION enablement.  

Perhaps it can be more general with a new:

config SOFT_RESERVE_CONSUMER
	bool

...which drivers, that scan the soft-reserve resources for ranges that
belong to them like DEV_DAX_HMEM and CXL_REGION, select. 

I am assuming that we want to define a point in the kernel init flow
where insert_resource() stops deferring Soft Reserved to this lookaside
tree, just not sure where that should be.

> 
> > Not that I expect this problem will ever effect more than just CXL, but
> > it is already the case that Soft Reserved is set for more than just CXL
> > ranges, and who know what other backend Soft Reserved consumer drivers
> > might arrive later.
> > 
> > When CXL or HMEM parses the deferred entries they can take
> > responsibility for injecting the Soft Reserved entries. That achieves
> > continuity of the /proc/iomem contents across kernel versions while
> > giving those endpoint drivers the ability to unregister those resources.
> > 
> >> +		else if (!res->parent && res->end)
> >>  			insert_resource_expand_to_fit(&iomem_resource, res);
> >> -		res++;
> >>  	}
> >>  
> >>  	/*
> >> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> >> index 21ad5f242875..c458a6313b31 100644
> >> --- a/drivers/cxl/core/region.c
> >> +++ b/drivers/cxl/core/region.c
> >> @@ -3226,6 +3226,12 @@ static int match_region_by_range(struct device *dev, void *data)
> >>  	return rc;
> >>  }
> >>  
> >> +static int insert_region_resource(struct resource *parent, struct resource *res)
> >> +{
> >> +	trim_soft_reserve_resources(res);
> >> +	return insert_resource(parent, res);
> >> +}
> > 
> > Per above, lets not do dynamic trimming, it's all or nothing CXL memory
> > enumeration if the driver is trying and failing to parse any part of the
> > BIOS-established CXL configuration.
> 
> That can be done. I felt it was easier to trim the SR resources as CXL regions
> were created instead of going back and finding all the CXL regions after all
> device probe completed and trimming them.

The problem is that at resource creation time in construct_region() it
is unknown whether that region will actually complete assembly. That is
why I was thinking that committing to CXL taking over the range needs to
be after that "all devices present at boot have probably finished
probing" event.

The easiest is undo everything on failure since that does not require
new code, but we could do the softer, "trim everything that got to the
point of succeeding cxl_region_probe(), and fallback to undecorated
dax-devices for the rest".




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux