On Thu, Feb 11, 2021 at 1:44 AM Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: > > On Wed, 10 Feb 2021 08:24:51 -0800 > Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > > On Wed, Feb 10, 2021 at 3:24 AM Jonathan Cameron > > <Jonathan.Cameron@xxxxxxxxxx> wrote: > > > > > > On Tue, 9 Feb 2021 19:55:05 -0800 > > > Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > > > > > > While the platform BIOS is able to describe the performance > > > > characteristics of CXL memory that is present at boot, it is unable to > > > > statically enumerate the performance of CXL memory hot inserted > > > > post-boot. The OS can enumerate most of the characteristics from link > > > > registers and CDAT, but the performance from the CPU to the host > > > > bridge, for example, is not enumerated by PCIE or CXL. Introduce an > > > > ACPI mechanism for this purpose. Critically this is achieved with a > > > > small tweak to how the existing Generic Initiator proximity domain is > > > > utilized in the HMAT. > > > > > > Hi Dan, > > > > > > Agree there is a hole here, but I think the proposed solution has some > > > issues for backwards compatibility. > > > > > > Just to clarify, I believe CDAT from root ports is sufficient for the > > > other direction (GI on CXL, memory in host). I wondered initially if > > > this was a two way issue, but after a reread, I think that is fine > > > with the root port providing CDAT or potentially treating the root > > > port as a GI (though that runs into the same naming / representation issue > > > as below and I think would need some clarifying text in UEFI GI description) > > > > > > http://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf > > > > > > For the case you are dealing with here potentially we 'could' add something > > > to CDAT as alternative to changing SRAT, but it would be more complex > > > so your approach here makes more sense to me. > > > > CDAT seems the wrong mechanism because it identifies target > > performance once you're at the front door of the device, not > > performance relative to a given initiator. > > I'd argue you could make CDAT a more symmetric representation, but it would > end up replicating a lot of info already in HMAT. Didn't say it was a good > idea! CDAT describes points, HMAT describes edges on the performance graph, it would be confusing if CDAT tried to supplant HMAT. > > That's an odd situation that it sort of 'half' manages it in the BIOS. > We probably need some supplementary additional docs around this topic > as the OS would need to be aware of that possibility and explicitly check > for it before doing its normal build based on CDAT + what you are proposing > here. Maybe code is enough but given this is cross OS stuff I'd argue > it probably isn't. > > I guess could revisit this draft Uefi white paper perhaps and add a bunch > of examples around this usecase https://github.com/hisilicon/acpi-numa-whitepaper Thanks for the reference, I'll take a look. > > > > > > > > > > > > > > # Impact of the Change > > > > The existing Generic Initiator Affinity Structure (ACPI 6.4 Section > > > > 5.2.16.6) already contains all the fields necessary to enumerate a > > > > generic target proximity domain. All that is missing is the > > > > interpretation of that proximity domain optionally as a target > > > > identifier in the HMAT. > > > > > > > > Given that the OS still needs to dynamically enumerate and instantiate > > > > the memory ranges behind the host bridge. The assumption is that > > > > operating systems that do not support native CXL enumeration will ignore > > > > this data in the HMAT, while CXL native enumeration aware environments > > > > will use this fragment of the performance path to calculate the > > > > performance characteristics. > > > > > > I don't think it is true that OS not supporting native CXL will ignore the > > > data. > > > > True, I should have chosen more careful words like s/ignore/not > > regress upon seeing/ > > It's a sticky corner and I suspect likely to come up at in ACPI WG - what is > being proposed here isn't backwards compatible It seems our definitions of backwards compatible are divergent. Please correct me if I'm wrong, but I understand your position to be "any perceptible OS behavior change breaks backwards compatibility", whereas I'm advocating that backwards compatibility is relative regressing real world use cases. That said, I do need to go mock this up in QEMU and verify how much disturbance it causes. > even if the impacts in Linux are small. I'd note the kernel would grind to a halt if the criteria for "backwards compatible" was zero perceptible behavior change. > Mostly it's infrastructure bring up that won't get used > (fallback lists and similar for a node which will never be specified in > allocations) and some confusing userspace ABI (which is more than a little > confusing already). Fallback lists are established relative to online nodes. These generic targets are not onlined as memory. > > > Linux will create a small amount of infrastructure to reflect them (more or > > > less the same as a memoryless node) and also they will appear in places > > > like access0 as a possible initiator of transactions. It's small stuff, > > > but I'd rather the impact on legacy was zero. > > > > I'm failing to see that small collision as fatal to the proposal. The > > HMAT parsing had a significant bug for multiple kernel releases and no > > one noticed. This quirk is minor in comparison. > > True, there is a lag in HMAT adoption - though for ACPI tables, not that long > (only a couple of years :) > > > > > > > > > So my gut feeling here is we shouldn't reuse the generic initiator, but > > > should invent something new. Would look similar to GI, but with a different > > > ID - to ensure legacy OS ignores it. > > > > A new id introduces more problems than it solves. Set aside the ACPICA > > thrash, it does not allow a clean identity mapping of a point in a > > system topology being both initiator and target. The SRAT does not > > need more data structures to convey this information. At most I would > > advocate for an OSC bit for the OS to opt into allowing this new usage > > in the HMAT, but that still feels like overkill absent a clear > > regression in legacy environments. > > OSC for this case doesn't work. You can't necessarily evaluate it > early enough in the boot - in Linux the node setup is before AML parsing > comes up. HMAT is evaluated a lot later, but SRAT is too early. + in theory > another OS is allowed to evaluate HMAT before OSC is available. The Linux node setup for online memory is before OSC parsing, but there's nothing to "online" with a GI/GT entry. Also, if this was a problem, it would already be impacting the OS today because this proposal only changes HMAT, not SRAT. Lastly there *is* an OSC bit for GI, so either that's vestigial and needs to be removed, or OSC is relevant for this case. > > > The fact that hardly anyone is > > using HMAT (as indicated by the bug I mentioned) gives me confidence > > that perfection is more "enemy of the good" than required here. > > How about taking this another way > > 1) Assume that the costs of 'false' GI nodes on legacy system as a result > of this is minor - so just live with it. (probably true, but as ever > need to confirm with other OS) > > 2) Try to remove the cost of pointless infrastructure on 'aware' kernels. > Add a flag to the GI entry to say it's a bridge and not expected to, > in of itself, represent an initiator or a target. > In Linux we then don't create the node intrastructure etc or assign > any devices to have the non existent NUMA node. > > The information is still there to combine with device info (CDAT) etc > and build what we eventually want in the way of a representation of > the topology that Linux can use. > > Now we just have the 'small' problem of figuring out how actually implement > hotplugging of NUMA nodes. I think it's tiny. Just pad the "possible" nodes past what SRAT enumerates.