Re: [PATCH v6 2/2] cxl: Calculate region bandwidth of targets with shared upstream link

Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> · Thu, 4 Jul 2024 15:03:44 +0100

On Wed, 3 Jul 2024 17:58:35 -0700
Dan Williams <dan.j.williams@xxxxxxxxx> wrote:

> Dave Jiang wrote:
> > The current bandwidth calculation aggregates all the targets. This simple
> > method does not take into account where multiple targets sharing under
> > a switch where the aggregated bandwidth can be less or greater than the
> > upstream link of the switch.
> > 
> > To accurately account for the shared upstream uplink cases, a new update
> > function is introduced by walking from the leaves to the root of the
> > hierarchy and adjust the bandwidth in the process. This process is done
> > when all the targets for a region are present but before the final values
> > are send to the HMAT handling code cached access_coordinate targets.
> > 
> > The original perf calculation path was kept to calculate the latency
> > performance data that does not require the shared link consideration.
> > The shared upstream link calculation is done as a second pass when all
> > the endpoints have arrived.  
> 
> The complication of this algorithm really wants some Documentation for
> regression testing it. Can you include some "how to test this" or how it
> was tested notes?

Hi Dave, Dan,

FWIW I finally managed to get a flexible QEMU setup for testing this
and it looks almost prefect wrt to reported BW. Note I can't
do 2 layer switches without fixing some other assumptions in
the CXL QEMU emualation (shortcuts we took a long time ago) +
I'm hoping your maths is good for any thing that isn't 1 or 2
devices at each level.

I've got one case that isn't giving right answer.

Imagine system with a CPU Die between two IO dies (each in their
own NUMA nodes - typically because there are multiple CPU dies
and we care about the latencies / bandwidths from each to the
Host bridges).  If we always interleaved then we would just
make one magic GP node, but we might not for reasons of what else
is below those ports.  Equal distance, opposite direction on
interconnect so separate BW. Topology wise this isn't a fairy tale
btw so we should make it work.  Also can easily end up with this
if people are doing cross socket interleave.

   _____________  ____________  ____________
  |             ||            ||            |
  |   IO Die    ||   CPU Die  ||   IO Die   |
  |             ||            ||            |
  |   HB A      ||            ||   HB B     |
  |_____________||____________||____________|
        |                            |

etc.

ACPI / Discoverable components.

                 _____________
                |             |
                |  CPU node 0 |
                |_____________|
                  |         |
         _________|___    __|__________
        |             |  |             |
        |   GP Node 1 |  | GP Node 2   |
        |             |  |             |
        |   HB A      |  | HB B        |
        |_____________|  |_____________|
              |                |
             RPX              RPY
              |                |
            Type3J           Type3K

If the minimum BW is the sum of the CPU0 to GP Node 1 and GP Node 2
I'm currently seeing it reported as just one of those (so half what
is expected) I've checked the ACPI tables and it's all correct.

I'm not 100% sure why yet but suspect it's the bit under the comment
/*
 * Take the min of the downstream aggregated bandwidth and the
 * GP provided bandwidth if the parent is CXL Root.
 */
In case of multiple GP Nodes being involved, need to aggregate
across them and I don't think that is currently done.

Tests run.

CPU and GI (GI nearer to GP so access 0 and access 1 are slightly different)

GP/ 1HB / 2RP / 2Direct Connect Type 3
- Minimum BW at GP
- Minimum BW as Sum of Links.
- Minimum BW as Sum of read / write at devices.

2GP in one numa node/ 2HB/ 1RP per HB / 1 type 3 per RP. (should be same as above)
- Minimum BW at GP

2GP in separate numa node - rest as previous
- Minimum BW at GP (should be double previous one as no sharing of the HMAT
  described part - but it's not :()

GP / 1HB / 1RP / Shared Link / SW USP / 2 SW DSP / Separate Link / Type 3.
- Minimum BW at GP
- Minimum BW on the shared link.
- Minimum BW on the sum of the SSLBIS values for the switch.
- Minimum BW as sum of separate links.
- Minimum BW on the type 3.

I'll post the patches that add x-speed and x-width controls to
the various downstream ends of links (QEMU link negotiation is
minimalist to put it lightly and if your USP is no capable
will merrily let it appear to train with the DSP end faster
than the USP end) and also sets all the USPs to
support up to 64G / 16X

To actually poke the corners I'm hacking the CDAT tables
as don't have properties to control of those bandwidths yet.
I guess that needs to be part of a series intended to allow
general testing of this functionality.

I've only been focusing on BW in these tests given that's
what matters here.

The algorithm you have to make this work is complex so
I'd definitely be keen on hearing if you think it could
be simplified - took me a while to figure out what was
going on in this series.

Jonathan

p.s. Anyone like writing tests?  Lots more tests we could
write for this.  I'll include some docs in cover letter for
the QEMU RFC.