Re: [PATCH v3 2/2] cxl: Calculate region bandwidth of targets with shared upstream link

Dave Jiang <dave.jiang@xxxxxxxxx> · Thu, 6 Jun 2024 16:39:24 -0700

On 6/5/24 7:19 AM, Jonathan Cameron wrote:
> On Wed, 29 May 2024 14:38:58 -0700
> Dave Jiang <dave.jiang@xxxxxxxxx> wrote:
> 
>> The current bandwidth calculation aggregates all the targets. This simple
>> method does not take into account where multiple targets sharing under
>> a switch where the aggregated bandwidth can be less or greater than the
>> upstream link of the switch.
>>
>> To accurately account for the shared upstream uplink cases, a new update
>> function is introduced by walking from the leaves to the root of the
>> hierachy and adjust the bandwidth in the process. This process is done when
>> all the targets for a region is present but before the final values are
> are present
> 
>> send to the HMAT handling code cached access_coordinate targets.
>>
>> The original perf calculation path was kept to calculate the configurations
>> without switches and also keep the latency data as that does not require
>> special calculation. The shared upstream link calculation is done as a
>> second pass when all the endpoints have arrived.
>>
>> Suggested-by: Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>
>> Link: https://lore.kernel.org/linux-cxl/20240501152503.00002e60@xxxxxxxxxx/
>> Signed-off-by: Dave Jiang <dave.jiang@xxxxxxxxx>
>>
>> ---
>> v3:
>> - Redo calculation based on Jonathan's suggestion. (Jonathan)
>> ---
>>  drivers/cxl/core/cdat.c   | 380 ++++++++++++++++++++++++++++++++++++--
> 
> Hi Dave,
> 
> The one case I couldn't quite figure out is that of multiple
> RP on a given RC.
> 
> Imagine bw as follows
> 
>       Host
>         | 3 from GP/HMAT
>    _____|_____
>   RP         RP
>   2|          |2
>  __|__     ___|__
> |1    |1  1|     |1
> EP   EP    EP    EP
> 
> I'm not sure following through the code that we get
> 3 (reported aggregate bw to the host bridge) rather than 4 (min elsewhere)
> 
> I might be missing why this case works...
> 
> Various minor comments inline.
> 
> Jonathan
> 
> 
>>  drivers/cxl/core/core.h   |   4 +-
>>  drivers/cxl/core/pci.c    |  21 +++
>>  drivers/cxl/core/port.c   |  26 +++
>>  drivers/cxl/core/region.c |   4 +
>>  drivers/cxl/cxl.h         |   1 +
>>  6 files changed, 418 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
>> index fea214340d4b..e77b416b0fe0 100644
>> --- a/drivers/cxl/core/cdat.c
>> +++ b/drivers/cxl/core/cdat.c
>> @@ -548,32 +548,378 @@ void cxl_coordinates_combine(struct access_coordinate *out,
>>  
>>  MODULE_IMPORT_NS(CXL);
>>  
>> -void cxl_region_perf_data_calculate(struct cxl_region *cxlr,
>> -				    struct cxl_endpoint_decoder *cxled)
>> +static void cxl_bandwidth_add(struct access_coordinate *coord,
>> +			      struct access_coordinate *c1,
>> +			      struct access_coordinate *c2)
>>  {
>> -	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
>> -	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>> -	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
>> -	struct range dpa = {
>> -			.start = cxled->dpa_res->start,
>> -			.end = cxled->dpa_res->end,
>> -	};
>> -	struct cxl_dpa_perf *perf;
>> +	for (int i = 0; i < ACCESS_COORDINATE_MAX; i++) {
>> +		coord[i].read_bandwidth = c1[i].read_bandwidth +
>> +					  c2[i].read_bandwidth;
>> +		coord[i].write_bandwidth = c1[i].write_bandwidth +
>> +					   c2[i].write_bandwidth;
>> +	}
>> +}
>>  
>> -	switch (cxlr->mode) {
>> +struct cxl_perf_ctx {
>> +	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
>> +	struct cxl_port *port;
>> +};
> 
> ...
> 
>> +static bool dpa_perf_contains(struct cxl_dpa_perf *perf,
>> +			      struct resource *dpa_res)
>> +{
>> +	struct range dpa = {
>> +			.start = dpa_res->start,
>> +			.end = dpa_res->end,
> over indented.

will fix

> 
>> +	};
>>  
>>  	if (!range_contains(&perf->dpa_range, &dpa))
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
> 
>> +static int cxl_endpoint_gather_coordinates(struct cxl_region *cxlr,
>> +					   struct cxl_endpoint_decoder *cxled,
>> +					   struct xarray *usp_xa)
>> +{
>> +	struct cxl_port *endpoint = to_cxl_port(cxled->cxld.dev.parent);
>> +	struct access_coordinate pci_coord[ACCESS_COORDINATE_MAX];
>> +	struct access_coordinate sw_coord[ACCESS_COORDINATE_MAX];
>> +	struct access_coordinate ep_coord[ACCESS_COORDINATE_MAX];
>> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
>> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
>> +	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
>> +	struct cxl_perf_ctx *c __free(kfree) = NULL;
> 
> Linus isn't fond of the fact free ordering is odd in these cases.
> Doesn't matter here, but it might if we refactor in future.
> You can reduce the scope anyway to make that a moot point.

I'll move it down. And other locations as well.
> 
>> +	struct cxl_perf_ctx *perf_ctx;
>> +	struct cxl_port *parent_port;
>> +	struct cxl_dpa_perf *perf;
>> +	unsigned long index;
>> +	void *ptr;
>> +	int rc;
>> +
>> +	if (cxlds->rcd)
>> +		return -ENODEV;
>> +
>> +	parent_port = to_cxl_port(endpoint->dev.parent);
>> +	/* No CXL switch upstream then skip out. */
>> +	if (is_cxl_root(parent_port))
>> +		return -ENODEV;
>> +
>> +	perf = cxl_memdev_get_dpa_perf(mds, cxlr->mode);
>> +	if (IS_ERR(perf))
>> +		return PTR_ERR(perf);
>> +
>> +	if (!dpa_perf_contains(perf, cxled->dpa_res))
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * The index for the xarray is the upstream port device of the upstream
>> +	 * CXL switch.
>> +	 */
>> +	index = (unsigned long)(unsigned long *)parent_port->uport_dev;
> 
> Random side note for another day. That seems rather inefficient as xarrays
> love densely clustered indexes and this will be aligned so bottom bunch
> of bits are 0. Ah well - improving that would make for messy code or
> some nice macros.
> 
> Why do you need the (unsigned long *) cast?

Will drop.
> 
> 
>> +	perf_ctx = xa_load(usp_xa, index);
>> +	if (!perf_ctx) {
>> +		c = kzalloc(sizeof(*perf_ctx), GFP_KERNEL);
> 
> scope of c is just in here I think, so drag the definition and destructor down here?
> 
>> +		if (!c)
>> +			return -ENOMEM;
>> +		ptr = xa_store(usp_xa, index, c, GFP_KERNEL);
>> +		if (xa_is_err(ptr))
>> +			return xa_err(ptr);
>> +		perf_ctx = no_free_ptr(c);
>> +	}
>> +
>> +	/* Upstream link bandwidth */
> Too may upstreams in here. I'd be more specific.
> 	/* Direct upstream link from EP bandwidth */

Will replace.

>> +	rc = cxl_pci_get_bandwidth(pdev, pci_coord);
>> +	if (rc < 0)
>> +		return rc;
>> +
>> +	/* Min of upstream link bandwidth and Endpoint CDAT bandwidth from DSLBIS */
>> +	cxl_coordinates_combine(ep_coord, pci_coord, perf->cdat_coord);
>> +
>> +	/*
>> +	 * Retrieve the switch SSLBIS for switch downstream port associated with
>> +	 * the endpoint bandwidth.
>> +	 */
>> +	rc = cxl_port_get_switch_dport_bandwidth(endpoint, sw_coord);
>> +	if (rc)
>> +		return rc;
>> +
>> +	/* Min of the earlier coordinates with the switch SSLBIS bandwidth */
>> +	cxl_coordinates_combine(ep_coord, ep_coord, sw_coord);
>> +
>> +	/*
>> +	 * Aggregate the computed bandwidth with the current aggregated bandwidth
>> +	 * of the endpoints with the same upstream switch port.
> 
> "switch upstream port"

Will replace.
> 
> Arguable the DSP could be the upstream switch port (from the EP)
> 
> 
>> +	 */
>> +	cxl_bandwidth_add(perf_ctx->coord, perf_ctx->coord, ep_coord);
>> +	perf_ctx->port = parent_port;
>> +
>> +	return 0;
>> +}
>> +
>> +static struct xarray *cxl_switch_iterate_coordinates(struct xarray *input_xa)
>> +{
>> +	struct xarray *res_xa __free(kfree) = kzalloc(sizeof(*res_xa), GFP_KERNEL);
>> +	struct access_coordinate coords[ACCESS_COORDINATE_MAX];
>> +	struct cxl_perf_ctx *n __free(kfree) = NULL;
> 
> Same comment as below. Reduce the scope of this and you'll have the destructor
> and constructor nicely paired.
> 
>> +	struct cxl_perf_ctx *ctx, *us_ctx;
>> +	unsigned long index, key;
>> +	bool is_root = false;
>> +	void *ptr;
>> +	int rc;
>> +
>> +	if (!res_xa)
>> +		return ERR_PTR(-ENOMEM);
>> +	xa_init(res_xa);
>> +
>> +	xa_for_each(input_xa, index, ctx) {
>> +		struct device *dev = (struct device *)(unsigned long *)index;
> 
> Why cast via an (unsigned long *) ?

Will drop
> 
>> +		struct cxl_port *parent_port, *port;
>> +		struct cxl_dport *dport;
>> +		struct pci_dev *pdev;
>> +
>> +		port = ctx->port;
>> +		parent_port = to_cxl_port(port->dev.parent);
>> +		dport = port->parent_dport;
>> +		/*
>> +		 * If parent port is CXL root port, then take the min of the parent
>> +		 * dport bandwidth, which comes from the generic target via ACPI
>> +		 * tables (SRAT & HMAT), and the bandwidth in the xarray. Then
>> +		 * update the bandwidth in the xarray.
>> +		 */
>> +		if (is_cxl_root(parent_port)) {
>> +			ctx->port = parent_port;
>> +			cxl_coordinates_combine(ctx->coord, ctx->coord,
>> +						dport->coord);
> 
> I'm a bit lost in all the levels of iteration so may have missed it.
> 
> Do we assume that GP BW (which is the root bridge) is shared across multiple root
> ports on that host bridge if they are both part of the interleave set?

Do we need to count the number of RPs under a HB and do min(aggregated_RPs_BW, (GP_BW / no of RPs) * affiliated_RPs_in_region)?

> 
>> +			is_root = true;
>> +			continue;
>> +		}
>> +
>> +		/* Below is the calculation for another switch upstream */
>> +
>> +		/*
>> +		 * If the device isn't an upstream PCIe port, there's something
>> +		 * wrong with the topology.
>> +		 */
>> +		if (!dev_is_pci(dev))
>> +			return ERR_PTR(-EINVAL);
>> +
>> +		/* Retrieve the upstream link bandwidth */
>> +		pdev = to_pci_dev(dev);
>> +		rc = cxl_pci_get_bandwidth(pdev, coords);
>> +		if (rc)
>> +			return ERR_PTR(-ENXIO);
>> +
>> +		/*
>> +		 * Take the min of downstream bandwidth and the upstream link
>> +		 * bandwidth.
>> +		 */
>> +		cxl_coordinates_combine(coords, coords, ctx->coord);
>> +		/*
>> +		 * Take the min of the calculated bandwdith and the upstream
>> +		 * switch SSLBIS bandwidth.
>> +		 */
>> +		cxl_coordinates_combine(coords, coords, dport->coord);
>> +
>> +		/*
>> +		 * Create an xarray entry with the key of the upstream
>> +		 * port of the upstream switch.
>> +		 */
>> +		key = (unsigned long)(unsigned long *)parent_port->uport_dev;
>> +		us_ctx = xa_load(res_xa, key);
>> +		if (!us_ctx) {
> 
> as above, define n here.
> 
>> +			n = kzalloc(sizeof(*n), GFP_KERNEL);
>> +			if (!n)
>> +				return ERR_PTR(-ENOMEM);
>> +			ptr = xa_store(res_xa, key, n, GFP_KERNEL);
>> +			if (xa_is_err(ptr))
>> +				return ERR_PTR(xa_err(ptr));
>> +			us_ctx = no_free_ptr(n);
>> +		}
>> +
>> +		/*
>> +		 * Aggregate the bandwidth based on the upstream switch.
>> +		 */
>> +		cxl_bandwidth_add(us_ctx->coord, us_ctx->coord, coords);
>> +		us_ctx->port = parent_port;
>> +	}
>> +
>> +	if (is_root)
>> +		return NULL;
>> +
>> +	return_ptr(res_xa);
>> +}
>> +
>> +static void cxl_region_update_access_coordinate(struct cxl_region *cxlr,
>> +						struct xarray *input_xa)
>> +{
>> +	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
>> +	struct cxl_perf_ctx *ctx;
>> +	unsigned long index;
>> +
>> +	memset(coord, 0, sizeof(coord));
>> +	xa_for_each(input_xa, index, ctx)
>> +		cxl_bandwidth_add(coord, coord, ctx->coord);
>> +
>> +	for (int i = 0; i < ACCESS_COORDINATE_MAX; i++) {
>> +		cxlr->coord[i].read_bandwidth = coord[i].read_bandwidth;
>> +		cxlr->coord[i].write_bandwidth = coord[i].write_bandwidth;
>> +	}
>> +}
>> +
>> +static void free_perf_xa(struct xarray *xa)
>> +{
>> +	struct cxl_perf_ctx *ctx;
>> +	unsigned long index;
>> +
>> +	if (!xa)
>> +		return;
>> +
>> +	xa_for_each(xa, index, ctx)
>> +		kfree(ctx);
>> +	xa_destroy(xa);
>> +	kfree(xa);
>> +}
>> +
>> +/*
>> + * cxl_region_shared_upstream_perf_update - Recalculate the access coordinates
>> + * @cxl_region: the cxl region to recalculate
>> + *
>> + * For certain region construction with endpoints behind CXL switches,
>> + * there is the possibility of the total bandwdith for all the endpoints
>> + * behind a switch between less or more than the switch upstream link. The
> 
> switch being less or more

will update

> 
>> + * algorithm assumes the configuration is a symmetric topology that puts
> 
> topology as that maximizes performance.

will update
> 
>> + * max performance into consideration. For a region topology without switch
>> + * involvement, the code exits without doing computation.
>> + *
> I suspect this restriction might bite us one day, but meh, it's complex enough for
> now and clearly stated what the limitation is!

The restriction is now gone with having to address shared RPs w/o switches.

> 
>> + * An example hierachy:
>> + *
>> + *                 CFMWS 0
>> + *                   |
>> + *          _________|_________
>> + *         |                   |
>> + *   GP0/HB0/ACPI0017-0  GP1/HB1/ACPI0017-1
>> + *     |          |        |           |
>> + *    RP0        RP1      RP2         RP3
>> + *     |          |        |           |
>> + *   SW 0       SW 1     SW 2        SW 3
>> + *   |   |      |   |    |   |       |   |
>> + *  EP0 EP1    EP2 EP3  EP4  EP5    EP6 EP7
>> + *
>> + * Computation for the example hierachy:
>> + *
>> + * Min (GP0 to CPU BW,
>> + *      Min(SW 0 Upstream Link to RP0 BW,
>> + *          Min(SW0SSLBIS for SW0DSP0 (EP0), EP0 DSLBIS, EP0 Upstream Link) +
>> + *          Min(SW0SSLBIS for SW0DSP1 (EP1), EP1 DSLBIS, EP1 Upstream link)) +
>> + *      Min(SW 1 Upstream Link to RP1 BW,
>> + *          Min(SW1SSLBIS for SW1DSP0 (EP2), EP2 DSLBIS, EP2 Upstream Link) +
>> + *          Min(SW1SSLBIS for SW1DSP1 (EP3), EP3 DSLBIS, EP3 Upstream link))) +
>> + * Min (GP1 to CPU BW,
>> + *      Min(SW 2 Upstream Link to RP2 BW,
>> + *          Min(SW2SSLBIS for SW2DSP0 (EP4), EP4 DSLBIS, EP4 Upstream Link) +
>> + *          Min(SW2SSLBIS for SW2DSP1 (EP5), EP5 DSLBIS, EP5 Upstream link)) +
>> + *      Min(SW 3 Upstream Link to RP3 BW,
>> + *          Min(SW3SSLBIS for SW3DSP0 (EP6), EP6 DSLBIS, EP6 Upstream Link) +
>> + *          Min(SW3SSLBIS for SW3DSP1 (EP7), EP7 DSLBIS, EP7 Upstream link))))
>> + */
>> +void cxl_region_shared_upstream_perf_update(struct cxl_region *cxlr)
>> +{
>> +	struct xarray *usp_xa, *iter_xa, *working_xa;
>> +	int rc;
>> +
>> +	lockdep_assert_held(&cxl_dpa_rwsem);
>> +
>> +	usp_xa = kzalloc(sizeof(*usp_xa), GFP_KERNEL);
>> +	if (!usp_xa)
>> +		return;
>> +
>> +	xa_init(usp_xa);
>> +
>> +	/*
>> +	 * Collect aggregated endpoint bandwidth and store the bandwidth in
>> +	 * an xarray indexed by the upstream port of the switch. The bandwidth
>> +	 * is aggregated per switch. Each endpoint consists of the minumum of
>> +	 * bandwidth from DSLBIS from the endpoint CDAT, the endpoint upstream
>> +	 * link bandwidth, and the bandwidth from the SSLBIS of the switch
>> +	 * CDAT for the downstream port that's associated with the endpoint.
> 
> CDAT for the switch upstream port to the downstream port that's ...

Will update.

> 
> 
> Should have p2p downstream to downstream numbers in there too I think and we
> don't yet care about those.
> 
>> +	 */
>> +	for (int i = 0; i < cxlr->params.nr_targets; i++) {
>> +		struct cxl_endpoint_decoder *cxled = cxlr->params.targets[i];
>> +
>> +		rc = cxl_endpoint_gather_coordinates(cxlr, cxled, usp_xa);
>> +		if (rc) {
>> +			free_perf_xa(usp_xa);
>> +			return;
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * Iterate through the components in the xarray and aggregate any
>> +	 * component that share the same upstream switch. The iteration
> 
> Nitpick but worth making it clear this is the VCS of a switch or avoid
> it by referring to sharing the upstream link from the switch.

Will update.

> 
>> +	 * takes consideration of multi-level switch hierachy.
>> +	 *
>> +	 * When cxl_switch_iterate_coordinates() detect the next level
>> +	 * upstream is a root port, it updates the bandwidth in the
>> +	 * xarray by taking the min of the provided bandwidth and
>> +	 * the bandwidth from the generic port. The function
>> +	 * returns NULL to indicate that the input xarray has been
>> +	 * updated.
>> +	 */
>> +	iter_xa = usp_xa;
>> +	usp_xa = NULL;
>> +	do {
>> +		working_xa = cxl_switch_iterate_coordinates(iter_xa);
>> +		if (IS_ERR(working_xa)) {
> 
> Maybe worth a traditional end of function cleanup and goto as there
> is always one of these to free. You could use __free() magic but
> that becomes less obvious when iterating so up to you.
> (there are examples in tree such as in property.h that do iteration
> with __free() magic but I'd not take them as good precedence given
> the author (me) :) )
> 
> err:
> 	free_perf_xa(iter_xa);

Will change to that.

> 
>> +			free_perf_xa(iter_xa);
>> +			return;
>> +		}
>> +		if (working_xa) {
>> +			free_perf_xa(iter_xa);
>> +			iter_xa = working_xa;
>> +		}
>> +	} while (working_xa);
>> +
>> +	/*
>> +	 * Aggregate the bandwidths in the xarray and update the region
>> +	 * bandwidths with the newly calculated bandwidths.
>> +	 */
>> +	cxl_region_update_access_coordinate(cxlr, iter_xa);
>> +	free_perf_xa(iter_xa);
>> +}
>> +
>> +void cxl_region_perf_data_calculate(struct cxl_region *cxlr,
>> +				    struct cxl_endpoint_decoder *cxled)
>> +{
>> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
>> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
> 
> Trivial and obviously this is in original code but you could take opportunity for
> 
> 	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);

Will update.

> 
> as cxlds isn't accessed in here otherwise.
> Nor is cxlmd, but folding that in as well feels too far to go.
> 
> 
>> +	struct cxl_dpa_perf *perf;
>> +
>> +	perf = cxl_memdev_get_dpa_perf(mds, cxlr->mode);
>> +	if (IS_ERR(perf))
>> +		return;
>> +
>> +	lockdep_assert_held(&cxl_dpa_rwsem);
>> +
>> +	if (!dpa_perf_contains(perf, cxled->dpa_res))
>>  		return;
>>  
>>  	for (int i = 0; i < ACCESS_COORDINATE_MAX; i++) {
> 
> 
> 
> 
> 
>> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
>> index 887ed6e358fb..2c7509a9a943 100644
>> --- a/drivers/cxl/core/port.c
>> +++ b/drivers/cxl/core/port.c
>> @@ -2259,6 +2259,32 @@ int cxl_endpoint_get_perf_coordinates(struct cxl_port *port,
>>  }
>>  EXPORT_SYMBOL_NS_GPL(cxl_endpoint_get_perf_coordinates, CXL);
>>  
>> +int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
>> +					struct access_coordinate *c)
>> +{
>> +	struct cxl_dport *dport;
>> +	struct cxl_port *iter;
> 
> The iter naming seems odd as no iterating going on..
> It's only there I think to check we aren't on a root port?
> 
> So
> 
> 	struct cxl_dport *dport = port->parent_dport;
> 	
> 	/* Check this port is connected to a switch DSP and not an RP */
> 	if (parent_port_is_cxl_root(to_cxl_port(port->dev.parent))
> 		return -ENODEV;
> 
> 	if (!coord...
> 
>  as before

Will simplify.

> 
> 
>> +
>> +	iter = port;
>> +	dport = port->parent_dport;
>> +	iter = to_cxl_port(iter->dev.parent);
>> +
>> +	if (parent_port_is_cxl_root(iter))
>> +		return -ENODEV;
>> +
>> +	if (!coordinates_valid(dport->coord)) {
>> +		dev_warn(dport->dport_dev, "XXX %s invalid coords\n", __func__);
> 
> XXX?  I guess a marker you were using to find it during debuging that
> wants to go away now.

Will remove. Leaked in when I was debugging. 
> 
> 
>> +		return -EINVAL;
>> +	}
>> +
>> +	for (int i = 0; i < ACCESS_COORDINATE_MAX; i++) {
>> +		c[i].read_bandwidth = dport->coord[i].read_bandwidth;
>> +		c[i].write_bandwidth = dport->coord[i].write_bandwidth;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>