Re: [PATCH v3 2/4] acpi/hmat / cxl: Add extended linear cache support for CXL

Dave Jiang <dave.jiang@xxxxxxxxx> · Fri, 21 Feb 2025 17:13:59 -0700

On 2/20/25 8:09 PM, Alison Schofield wrote:
> On Fri, Jan 17, 2025 at 10:28:31AM -0700, Dave Jiang wrote:
>> The current cxl region size only indicates the size of the CXL memory
>> region without accounting for the extended linear cache size. Retrieve the
>> cache size from HMAT and append that to the cxl region size for the cxl
>> region range that matches the SRAT range that has extended linear cache
>> enabled.
>>
>> The SRAT defines the whole memory range that includes the extended linear
>> cache and the CXL memory region. The new HMAT ECN/ECR to the Memory Side
>> Cache Information Structure defines the size of the extended linear cache
>> size and matches to the SRAT Memory Affinity Structure by the memory
>> proxmity domain. Add a helper to match the cxl range to the SRAT memory
>> range in order to retrieve the cache size.
>>
>> There are several places that checks the cxl region range against the
>> decoder range. Use new helper to check between the two ranges and address
>> the new cache size.
> 
> This reads like we are inflating the region size by cache size, and then
> changing region set up code to account for the inflation. So, I'm going
> to question where we need to do that inflation.
> 
> When the new region param p->cache_size is calculated it is added directly
> to the p->res and that leads to much of the other work in region.c
> 
> Could p->cache_size be used as an addend when needed, like:
> - Add it to the insert_resource in construct_region().
> - Add it to the sysfs show's for region resource start and resource size.
> 
> Then when we get to dpa to hpa address translation, the p->res start
> doesn't need adjusting either. As it is now, it's the cache start
> and I think it should be the cxl resource start.
> 
> The touchpoints may grow in the direction I'm suggesting that make
> it a poorer choice than what is here now. Maybe its time for the
> something like a cxl_resource and a non_cxl_resource that add together
> to make the region_resource.
> 
> I haven't been following this patch set all along, just started looking
> yesterday, so I'm prepared to be  way off base. Figure blurting it out
> at this point is the faster path forward. 
> 
> More comments related below...
> 
> 
>> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> snip
> 
>> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> snip
> 
>> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
>> index b98b1ccffd1c..2d8699a86b24 100644
>> --- a/drivers/cxl/core/region.c
>> +++ b/drivers/cxl/core/region.c
>> @@ -824,6 +824,21 @@ static int match_free_decoder(struct device *dev, void *data)
>>  	return 1;
>>  }
>>  
>> +static bool region_res_match_cxl_range(struct cxl_region_params *p,
>> +				       struct range *range)
>> +{
>> +	if (!p->res)
>> +		return false;
>> +
>> +	/*
>> +	 * If an extended linear cache region then the CXL range is assumed
>> +	 * to be fronted by the DRAM range in current known implementation.
>> +	 * This assumption will be made until a variant implementation exists.
>> +	 */
>> +	return p->res->start + p->cache_size == range->start &&
>> +		p->res->end == range->end;
>> +}
>> +
>>  static int match_auto_decoder(struct device *dev, void *data)
>>  {
>>  	struct cxl_region_params *p = data;
>> @@ -836,7 +851,7 @@ static int match_auto_decoder(struct device *dev, void *data)
>>  	cxld = to_cxl_decoder(dev);
>>  	r = &cxld->hpa_range;
>>  
>> -	if (p->res && p->res->start == r->start && p->res->end == r->end)
>> +	if (region_res_match_cxl_range(p, r))
>>  		return 1;
> 
> if we don't change p->res directly, this isn't needed.

It does get changed so it's needed. A lot of these changes are done after tripping setup failures during testing and debugging.

> 
>>  	return 0;
>> @@ -1424,8 +1439,7 @@ static int cxl_port_setup_targets(struct cxl_port *port,
>>  	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
>>  		if (cxld->interleave_ways != iw ||
>>  		    cxld->interleave_granularity != ig ||
>> -		    cxld->hpa_range.start != p->res->start ||
>> -		    cxld->hpa_range.end != p->res->end ||
>> +		    !region_res_match_cxl_range(p, &cxld->hpa_range) ||
> 
> similar
> 
>>  		    ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
>>  			dev_err(&cxlr->dev,
>>  				"%s:%s %s expected iw: %d ig: %d %pr\n",
>> @@ -1949,7 +1963,7 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>>  		return -ENXIO;
>>  	}
>>  
>> -	if (resource_size(cxled->dpa_res) * p->interleave_ways !=
>> +	if (resource_size(cxled->dpa_res) * p->interleave_ways + p->cache_size !=
>>  	    resource_size(p->res)) {
> 
> similar
> 
>>  		dev_dbg(&cxlr->dev,
>>  			"%s:%s: decoder-size-%#llx * ways-%d != region-size-%#llx\n",
>> @@ -3221,6 +3235,45 @@ static int match_region_by_range(struct device *dev, void *data)
>>  	return rc;
>>  }
>>  
>> +static int cxl_extended_linear_cache_resize(struct cxl_region *cxlr,
>> +					    struct resource *res)
>> +{
>> +	struct cxl_region_params *p = &cxlr->params;
>> +	int nid = phys_to_target_node(res->start);
>> +	resource_size_t size, cache_size;
>> +	int rc;
>> +
>> +	size = resource_size(res);
>> +	if (!size)
>> +		return -EINVAL;
>> +
>> +	rc = cxl_acpi_get_extended_linear_cache_size(res, nid, &cache_size);
>> +	if (rc)
>> +		return rc;
>> +
>> +	if (!cache_size)
>> +		return 0;
>> +
>> +	if (size != cache_size) {
>> +		dev_warn(&cxlr->dev, "Extended Linear Cache is not 1:1, unsupported!");
>> +		return -EOPNOTSUPP;
>> +	}
>> +
>> +	/*
>> +	 * Move the start of the range to where the cache range starts. The
>> +	 * implementation assumes that the cache range is in front of the
>> +	 * CXL range. This is not dictated by the HMAT spec but is how the
>> +	 * current known implementation is configured.
>> +	 *
>> +	 * The cache range is expected to be within the CFMWS. The adjusted
>> +	 * res->start should not be less than cxlrd->res->start.
> 
> Check for 'cache range is expected to be within the CFMWS' ?

Will add

> 
> 
>> +	 */
>> +	res->start -= cache_size;
>> +	p->cache_size = cache_size;
>> +
>> +	return 0;
>> +}
>> +
>>  /* Establish an empty region covering the given HPA range */
>>  static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
>>  					   struct cxl_endpoint_decoder *cxled)
>> @@ -3267,6 +3320,18 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
>>  
>>  	*res = DEFINE_RES_MEM_NAMED(hpa->start, range_len(hpa),
>>  				    dev_name(&cxlr->dev));
>> +
>> +	rc = cxl_extended_linear_cache_resize(cxlr, res);
>> +	if (rc) {
>> +		/*
>> +		 * Failing to support extended linear cache region resize does not
>> +		 * prevent the region from functioning. Only causes cxl list showing
>> +		 * incorrect region size.
> 
> Also cxlr_hpa_cache_alias() lookups will fail for cxl events, so no
> hpa_alias in trace events.

Right. But it needs to report the near memory alias vs the CXL address. hpa_alias is used interchangeably and not necessarily specific to near or far memory.

> 
>> +		 */
>> +		dev_warn(cxlmd->dev.parent,
>> +			 "Failed to support extended linear cache.\n");
> 
> Maybe more specifics of what is/isn't present.

It's just a general catch all for whatever failures from retrieving the cache size and calculate the start address.

> 
>> +	}
>> +
>>  	rc = insert_resource(cxlrd->res, res);
> 
> Cut off in this diff is the "p->res = res" assignment that follows,
> which then makes all the previous changes regarding matching decoder
> ranges necessary.

yes

> 
> 
>>  	if (rc) {
>>  		/*
>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> snip
> 
>> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> snip
> 
>> diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
> snip
>