On Wed, 3 Jul 2024 17:58:35 -0700 Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > Dave Jiang wrote: > > The current bandwidth calculation aggregates all the targets. This simple > > method does not take into account where multiple targets sharing under > > a switch where the aggregated bandwidth can be less or greater than the > > upstream link of the switch. > > > > To accurately account for the shared upstream uplink cases, a new update > > function is introduced by walking from the leaves to the root of the > > hierarchy and adjust the bandwidth in the process. This process is done > > when all the targets for a region are present but before the final values > > are send to the HMAT handling code cached access_coordinate targets. > > > > The original perf calculation path was kept to calculate the latency > > performance data that does not require the shared link consideration. > > The shared upstream link calculation is done as a second pass when all > > the endpoints have arrived. > > The complication of this algorithm really wants some Documentation for > regression testing it. Can you include some "how to test this" or how it > was tested notes? Hi Dave, Dan, FWIW I finally managed to get a flexible QEMU setup for testing this and it looks almost prefect wrt to reported BW. Note I can't do 2 layer switches without fixing some other assumptions in the CXL QEMU emualation (shortcuts we took a long time ago) + I'm hoping your maths is good for any thing that isn't 1 or 2 devices at each level. I've got one case that isn't giving right answer. Imagine system with a CPU Die between two IO dies (each in their own NUMA nodes - typically because there are multiple CPU dies and we care about the latencies / bandwidths from each to the Host bridges). If we always interleaved then we would just make one magic GP node, but we might not for reasons of what else is below those ports. Equal distance, opposite direction on interconnect so separate BW. Topology wise this isn't a fairy tale btw so we should make it work. Also can easily end up with this if people are doing cross socket interleave. _____________ ____________ ____________ | || || | | IO Die || CPU Die || IO Die | | || || | | HB A || || HB B | |_____________||____________||____________| | | etc. ACPI / Discoverable components. _____________ | | | CPU node 0 | |_____________| | | _________|___ __|__________ | | | | | GP Node 1 | | GP Node 2 | | | | | | HB A | | HB B | |_____________| |_____________| | | RPX RPY | | Type3J Type3K If the minimum BW is the sum of the CPU0 to GP Node 1 and GP Node 2 I'm currently seeing it reported as just one of those (so half what is expected) I've checked the ACPI tables and it's all correct. I'm not 100% sure why yet but suspect it's the bit under the comment /* * Take the min of the downstream aggregated bandwidth and the * GP provided bandwidth if the parent is CXL Root. */ In case of multiple GP Nodes being involved, need to aggregate across them and I don't think that is currently done. Tests run. CPU and GI (GI nearer to GP so access 0 and access 1 are slightly different) GP/ 1HB / 2RP / 2Direct Connect Type 3 - Minimum BW at GP - Minimum BW as Sum of Links. - Minimum BW as Sum of read / write at devices. 2GP in one numa node/ 2HB/ 1RP per HB / 1 type 3 per RP. (should be same as above) - Minimum BW at GP 2GP in separate numa node - rest as previous - Minimum BW at GP (should be double previous one as no sharing of the HMAT described part - but it's not :() GP / 1HB / 1RP / Shared Link / SW USP / 2 SW DSP / Separate Link / Type 3. - Minimum BW at GP - Minimum BW on the shared link. - Minimum BW on the sum of the SSLBIS values for the switch. - Minimum BW as sum of separate links. - Minimum BW on the type 3. I'll post the patches that add x-speed and x-width controls to the various downstream ends of links (QEMU link negotiation is minimalist to put it lightly and if your USP is no capable will merrily let it appear to train with the DSP end faster than the USP end) and also sets all the USPs to support up to 64G / 16X To actually poke the corners I'm hacking the CDAT tables as don't have properties to control of those bandwidths yet. I guess that needs to be part of a series intended to allow general testing of this functionality. I've only been focusing on BW in these tests given that's what matters here. The algorithm you have to make this work is complex so I'd definitely be keen on hearing if you think it could be simplified - took me a while to figure out what was going on in this series. Jonathan p.s. Anyone like writing tests? Lots more tests we could write for this. I'll include some docs in cover letter for the QEMU RFC.