On 6/20/24 3:45 AM, Jonathan Cameron wrote: > On Tue, 18 Jun 2024 16:16:41 -0700 > Dave Jiang <dave.jiang@xxxxxxxxx> wrote: > >> The current bandwidth calculation aggregates all the targets. This simple >> method does not take into account where multiple targets sharing under >> a switch where the aggregated bandwidth can be less or greater than the >> upstream link of the switch. >> >> To accurately account for the shared upstream uplink cases, a new update >> function is introduced by walking from the leaves to the root of the >> hierarchy and adjust the bandwidth in the process. This process is done >> when all the targets for a region are present but before the final values >> are send to the HMAT handling code cached access_coordinate targets. >> >> The original perf calculation path was kept to calculate the latency >> performance data that does not require the shared link consideration. >> The shared upstream link calculation is done as a second pass when all >> the endpoints have arrived. >> >> Suggested-by: Jonathan Cameron <jonathan.cameron@xxxxxxxxxx> >> Link: https://lore.kernel.org/linux-cxl/20240501152503.00002e60@xxxxxxxxxx/ >> Signed-off-by: Dave Jiang <dave.jiang@xxxxxxxxx> > > A few trivial things inline. I haven't tested this yet in anger > (as for 'reasons' my test machine is running an emulation of s390 emulating > x86 this morning and so is a bit busy). > > I'll try and get at least some tests done in the near future and the > additional control patches for doing so posted for the QEMU emulation. > Feel free to poke me if I've not given a tested by the end of next week. > > In meantime, lgtm. > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> > >> >> --- >> v5: >> - Fix cxl_memdev_get_dpa_perf() default return. (Jonathan) >> - Direct return dpa_perf_contains(). (Jonathan) >> - Fix incorrect bandwidth member reference. (Jonathan) >> - Direct return error for pcie_link_speed_mbps(). (Jonathan) >> - Remove stray edit. (Jonathan) >> - Adjust calculated bandwidth of RPs sharing same host bridge. (Jonathan) >> - Fix uninit var gp_is_root. (kernel test robot) >> --- >> drivers/cxl/core/cdat.c | 411 ++++++++++++++++++++++++++++++++++++-- >> drivers/cxl/core/core.h | 4 +- >> drivers/cxl/core/pci.c | 23 +++ >> drivers/cxl/core/port.c | 20 ++ >> drivers/cxl/core/region.c | 4 + >> drivers/cxl/cxl.h | 1 + >> 6 files changed, 446 insertions(+), 17 deletions(-) >> >> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c >> index fea214340d4b..72f86bc99750 100644 >> --- a/drivers/cxl/core/cdat.c >> +++ b/drivers/cxl/core/cdat.c > > ... > >> +static struct cxl_dpa_perf *cxl_memdev_get_dpa_perf(struct cxl_memdev_state *mds, >> + enum cxl_decoder_mode mode) >> +{ >> + switch (mode) { >> case CXL_DECODER_RAM: >> - perf = &mds->ram_perf; >> - break; >> + return &mds->ram_perf; >> case CXL_DECODER_PMEM: >> - perf = &mds->pmem_perf; >> - break; >> + return &mds->pmem_perf; >> default: >> + return ERR_PTR(-EINVAL); >> + } >> + > > Can't get here so delete the below return. I was worried that the compiler may complain. But pleasantly surprised that the newer compilers are smart enough to figure this out. > >> + return ERR_PTR(-EINVAL); >> +} > >> +static int cxl_endpoint_gather_coordinates(struct cxl_region *cxlr, >> + struct cxl_endpoint_decoder *cxled, >> + struct xarray *usp_xa) >> +{ >> + struct cxl_port *endpoint = to_cxl_port(cxled->cxld.dev.parent); >> + struct access_coordinate pci_coord[ACCESS_COORDINATE_MAX]; >> + struct access_coordinate sw_coord[ACCESS_COORDINATE_MAX]; >> + struct access_coordinate ep_coord[ACCESS_COORDINATE_MAX]; >> + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled); >> + struct cxl_dev_state *cxlds = cxlmd->cxlds; >> + struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds); >> + struct pci_dev *pdev = to_pci_dev(cxlds->dev); >> + struct cxl_port *parent_port, *gp_port; >> + struct cxl_perf_ctx *perf_ctx; >> + struct cxl_dpa_perf *perf; >> + bool gp_is_root = false; >> + unsigned long index; >> + void *ptr; >> + int rc; >> + >> + if (cxlds->rcd) >> + return -ENODEV; >> + >> + parent_port = to_cxl_port(endpoint->dev.parent); >> + gp_port = to_cxl_port(parent_port->dev.parent); >> + if (is_cxl_root(gp_port)) >> + gp_is_root = true; > > gp_is_root = is_cxl_root(gp_port); > and drop the initialization at declaration. > Or just use is_cxl_root(gp_port) at the check below. > > Could set all these at declaration though I guess a few lines > would be a bit long. I'll clean those up. > >> + >> + perf = cxl_memdev_get_dpa_perf(mds, cxlr->mode); >> + if (IS_ERR(perf)) >> + return PTR_ERR(perf); >> + >> + if (!dpa_perf_contains(perf, cxled->dpa_res)) >> + return -EINVAL; >> + >> + /* >> + * The index for the xarray is the upstream port device of the upstream >> + * CXL switch. >> + */ >> + index = (unsigned long)parent_port->uport_dev; >> + perf_ctx = xa_load(usp_xa, index); >> + if (!perf_ctx) { >> + struct cxl_perf_ctx *c __free(kfree) = >> + kzalloc(sizeof(*perf_ctx), GFP_KERNEL); >> + >> + if (!c) >> + return -ENOMEM; >> + ptr = xa_store(usp_xa, index, c, GFP_KERNEL); >> + if (xa_is_err(ptr)) >> + return xa_err(ptr); >> + perf_ctx = no_free_ptr(c); >> + } >> + >> + /* Direct upstream link from EP bandwidth */ >> + rc = cxl_pci_get_bandwidth(pdev, pci_coord); >> + if (rc < 0) >> + return rc; >> + >> + /* >> + * Min of upstream link bandwidth and Endpoint CDAT bandwidth from >> + * DSLBIS. >> + */ >> + cxl_coordinates_combine(ep_coord, pci_coord, perf->cdat_coord); >> + >> + /* >> + * If grandparent port is root, then there's no switch involved and >> + * the endpoint is connected to a root port. >> + */ >> + if (!gp_is_root) { >> + /* >> + * Retrieve the switch SSLBIS for switch downstream port >> + * associated with the endpoint bandwidth. >> + */ >> + rc = cxl_port_get_switch_dport_bandwidth(endpoint, sw_coord); >> + if (rc) >> + return rc; >> + >> + /* >> + * Min of the earlier coordinates with the switch SSLBIS >> + * bandwidth >> + */ >> + cxl_coordinates_combine(ep_coord, ep_coord, sw_coord); >> + } >> + >> + /* >> + * Aggregate the computed bandwidth with the current aggregated bandwidth >> + * of the endpoints with the same switch upstream port. >> + */ >> + cxl_bandwidth_add(perf_ctx->coord, perf_ctx->coord, ep_coord); >> + perf_ctx->port = parent_port; >> + >> + return 0; >> +} >> + >> +static struct xarray *cxl_switch_iterate_coordinates(struct xarray *input_xa, >> + bool *parent_is_root) >> +{ >> + struct xarray *res_xa __free(kfree) = kzalloc(sizeof(*res_xa), GFP_KERNEL); >> + struct access_coordinate coords[ACCESS_COORDINATE_MAX]; >> + struct cxl_perf_ctx *ctx, *us_ctx; >> + unsigned long index, us_index; >> + void *ptr; >> + int rc; >> + >> + if (!res_xa) >> + return ERR_PTR(-ENOMEM); >> + xa_init(res_xa); >> + >> + *parent_is_root = false; >> + xa_for_each(input_xa, index, ctx) { >> + struct cxl_port *parent_port, *port, *gp_port; >> + struct device *dev = (struct device *)index; >> + struct cxl_dport *dport; >> + struct pci_dev *pdev; >> + bool gp_is_root; >> + >> + gp_is_root = false; >> + port = ctx->port; >> + parent_port = to_cxl_port(port->dev.parent); > > Maybe set those at declaration? Yep will clean up those. > >> + if (is_cxl_root(parent_port)) { >> + *parent_is_root = true; >> + } else { >> + gp_port = to_cxl_port(parent_port->dev.parent); >> + gp_is_root = is_cxl_root(gp_port); >> + } > > From a pure code organization point of view, this could be moved > below the upstream port section to where these are first used. > However I can see some advantage in gathering all the 'who am I' > stuff in one place, even if it costs a few lines of code. > So I'm fine with this if you prefer it. I'll move them and get it cleaned up. > >> + >> + dport = port->parent_dport; >> + >> + /* >> + * Create an xarray entry with the key of the upstream >> + * port of the upstream switch. >> + */ >> + us_index = (unsigned long)parent_port->uport_dev; >> + us_ctx = xa_load(res_xa, us_index); >> + if (!us_ctx) { >> + struct cxl_perf_ctx *n __free(kfree) = >> + kzalloc(sizeof(*n), GFP_KERNEL); >> + >> + if (!n) >> + return ERR_PTR(-ENOMEM); >> + >> + ptr = xa_store(res_xa, us_index, n, GFP_KERNEL); >> + if (xa_is_err(ptr)) >> + return ERR_PTR(xa_err(ptr)); >> + us_ctx = no_free_ptr(n); >> + } >> + >> + us_ctx->port = parent_port; >> + >> + /* >> + * Take the min of the downstream aggregated bandwdith and the > > bandwidth thanks will fix > >> + * GP provided bandwidth if parent is CXL root. >> + */ >> + if (*parent_is_root) { >> + cxl_coordinates_combine(us_ctx->coord, ctx->coord, dport->coord); >> + continue; >> + } >> + >> + /* Below is the calculation for another switch upstream */ >> + if (!gp_is_root) { >> + /* >> + * If the device isn't an upstream PCIe port, there's something >> + * wrong with the topology. >> + */ >> + if (!dev_is_pci(dev)) >> + return ERR_PTR(-EINVAL); >> + >> + /* Retrieve the upstream link bandwidth */ >> + pdev = to_pci_dev(dev); >> + rc = cxl_pci_get_bandwidth(pdev, coords); > > might as well combine these 2 lines and drop the pdev local variable > (which otherwise could have been scoped to this if() block ok will change DJ > > rc = cxl_pci_get_bandwidth(to_pci_dev(dev), coords); > >> + if (rc) >> + return ERR_PTR(-ENXIO); >> + >> + /* >> + * Take the min of downstream bandwidth and the upstream link >> + * bandwidth. >> + */ >> + cxl_coordinates_combine(coords, coords, ctx->coord); >> + >> + /* >> + * Take the min of the calculated bandwdith and the upstream >> + * switch SSLBIS bandwidth. >> + */ >> + cxl_coordinates_combine(coords, coords, dport->coord); >> + >> + /* >> + * Aggregate the calculated bandwidth common to an upstream >> + * switch. >> + */ >> + cxl_bandwidth_add(us_ctx->coord, us_ctx->coord, coords); >> + } else { >> + /* >> + * Aggregate the bandwidth common to an upstream switch. >> + */ >> + cxl_bandwidth_add(us_ctx->coord, us_ctx->coord, >> + ctx->coord); >> + } >> + } >> + >> + return_ptr(res_xa); >> +} >> + >> +static void cxl_region_update_access_coordinate(struct cxl_region *cxlr, >> + struct xarray *input_xa) >> +{ >> + struct access_coordinate coord[ACCESS_COORDINATE_MAX]; >> + struct cxl_perf_ctx *ctx; >> + unsigned long index; >> + >> + memset(coord, 0, sizeof(coord)); >> + xa_for_each(input_xa, index, ctx) >> + cxl_bandwidth_add(coord, coord, ctx->coord); >> + >> + for (int i = 0; i < ACCESS_COORDINATE_MAX; i++) { >> + cxlr->coord[i].read_bandwidth = coord[i].read_bandwidth; >> + cxlr->coord[i].write_bandwidth = coord[i].write_bandwidth; >> + } >> +} >> + >> +static void free_perf_xa(struct xarray *xa) >> +{ >> + struct cxl_perf_ctx *ctx; >> + unsigned long index; >> + >> + if (!xa) >> + return; >> + >> + xa_for_each(xa, index, ctx) >> + kfree(ctx); >> + xa_destroy(xa); >> + kfree(xa); >> +} >> + >> +/* >> + * cxl_region_shared_upstream_perf_update - Recalculate the access coordinates >> + * @cxl_region: the cxl region to recalculate >> + * >> + * For certain region construction with endpoints behind CXL switches, >> + * there is the possibility of the total bandwdith for all the endpoints >> + * behind a switch being less or more than the switch upstream link. The >> + * algorithm assumes the configuration is a symmetric topology as that >> + * maximizes performance. >> + * >> + * There can be multiple switches under a RP. There can be multiple RPs under >> + * a HB. >> + * >> + * An example hierarchy: >> + * >> + * CFMWS 0 >> + * | >> + * _________|_________ >> + * | | >> + * ACPI0017-0 ACPI0017-1 >> + * GP0/HB0/ACPI0016-0 GP1/HB1/ACPI0016-1 >> + * | | | | >> + * RP0 RP1 RP2 RP3 >> + * | | | | >> + * SW 0 SW 1 SW 2 SW 3 >> + * | | | | | | | | >> + * EP0 EP1 EP2 EP3 EP4 EP5 EP6 EP7 >> + * >> + * Computation for the example hierarchy: >> + * >> + * Min (GP0 to CPU BW, >> + * Min(SW 0 Upstream Link to RP0 BW, >> + * Min(SW0SSLBIS for SW0DSP0 (EP0), EP0 DSLBIS, EP0 Upstream Link) + >> + * Min(SW0SSLBIS for SW0DSP1 (EP1), EP1 DSLBIS, EP1 Upstream link)) + >> + * Min(SW 1 Upstream Link to RP1 BW, >> + * Min(SW1SSLBIS for SW1DSP0 (EP2), EP2 DSLBIS, EP2 Upstream Link) + >> + * Min(SW1SSLBIS for SW1DSP1 (EP3), EP3 DSLBIS, EP3 Upstream link))) + >> + * Min (GP1 to CPU BW, >> + * Min(SW 2 Upstream Link to RP2 BW, >> + * Min(SW2SSLBIS for SW2DSP0 (EP4), EP4 DSLBIS, EP4 Upstream Link) + >> + * Min(SW2SSLBIS for SW2DSP1 (EP5), EP5 DSLBIS, EP5 Upstream link)) + >> + * Min(SW 3 Upstream Link to RP3 BW, >> + * Min(SW3SSLBIS for SW3DSP0 (EP6), EP6 DSLBIS, EP6 Upstream Link) + >> + * Min(SW3SSLBIS for SW3DSP1 (EP7), EP7 DSLBIS, EP7 Upstream link)))) >> + */ >> +void cxl_region_shared_upstream_perf_update(struct cxl_region *cxlr) >> +{ >> + struct xarray *usp_xa, *iter_xa, *working_xa; >> + bool is_root; >> + int rc; >> + >> + lockdep_assert_held(&cxl_dpa_rwsem); >> + >> + usp_xa = kzalloc(sizeof(*usp_xa), GFP_KERNEL); >> + if (!usp_xa) >> return; >> + >> + xa_init(usp_xa); >> + >> + /* >> + * Collect aggregated endpoint bandwidth and store the bandwidth in >> + * an xarray indexed by the upstream port of the switch or RP. The >> + * bandwidth is aggregated per switch. Each endpoint consists of the >> + * minimum of bandwidth from DSLBIS from the endpoint CDAT, the endpoint >> + * upstream link bandwidth, and the bandwidth from the SSLBIS of the >> + * switch CDAT for the switch upstream port to the downstream port that's >> + * associated with the endpoint. If the device is directly connected to >> + * a RP, then no SSLBIS is involved. >> + */ >> + for (int i = 0; i < cxlr->params.nr_targets; i++) { >> + struct cxl_endpoint_decoder *cxled = cxlr->params.targets[i]; >> + >> + rc = cxl_endpoint_gather_coordinates(cxlr, cxled, usp_xa); >> + if (rc) { >> + free_perf_xa(usp_xa); >> + return; >> + } >> } >> >> + iter_xa = usp_xa; >> + usp_xa = NULL; >> + /* >> + * Iterate through the components in the xarray and aggregate any >> + * component that share the same upstream link from the switch. >> + * The iteration takes consideration of multi-level switch >> + * hierarchy. >> + * >> + * When cxl_switch_iterate_coordinates() detect the grandparent >> + * upstream is a cxl root, it aggregates the bandwidth under >> + * the host bridge. A RP does not have SSLBIS nor does it have >> + * upstream PCIe link. >> + * >> + * When cxl_switch_iterate_coordinates() detect the parent upstream >> + * is a cxl root, it takes the min() of the aggregated RP perfs and >> + * the bandwidth from the Generic Port (GP). 'is_root' is set at this >> + * point as well. >> + */ >> + do { >> + working_xa = cxl_switch_iterate_coordinates(iter_xa, &is_root); >> + if (IS_ERR(working_xa)) >> + goto out; >> + free_perf_xa(iter_xa); >> + iter_xa = working_xa; >> + } while (!is_root); >> + >> + /* >> + * Aggregate the bandwidths in the xarray (for all the HBs) and update >> + * the region bandwidths with the newly calculated bandwidths. >> + */ >> + cxl_region_update_access_coordinate(cxlr, iter_xa); >> + >> +out: >> + free_perf_xa(iter_xa); >> +} >