On Fri, Feb 12, 2021 at 4:26 AM Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: > > On Thu, 11 Feb 2021 09:06:51 -0800 > Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > > On Thu, Feb 11, 2021 at 1:44 AM Jonathan Cameron > > <Jonathan.Cameron@xxxxxxxxxx> wrote: > > > > > > On Wed, 10 Feb 2021 08:24:51 -0800 > > > Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > > > > > > On Wed, Feb 10, 2021 at 3:24 AM Jonathan Cameron > > > > <Jonathan.Cameron@xxxxxxxxxx> wrote: > > > > > > > > > > On Tue, 9 Feb 2021 19:55:05 -0800 > > > > > Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > > > > > > > > > > While the platform BIOS is able to describe the performance > > > > > > characteristics of CXL memory that is present at boot, it is unable to > > > > > > statically enumerate the performance of CXL memory hot inserted > > > > > > post-boot. The OS can enumerate most of the characteristics from link > > > > > > registers and CDAT, but the performance from the CPU to the host > > > > > > bridge, for example, is not enumerated by PCIE or CXL. Introduce an > > > > > > ACPI mechanism for this purpose. Critically this is achieved with a > > > > > > small tweak to how the existing Generic Initiator proximity domain is > > > > > > utilized in the HMAT. > > > > > > > > > > Hi Dan, > > > > > > > > > > Agree there is a hole here, but I think the proposed solution has some > > > > > issues for backwards compatibility. > > > > > > > > > > Just to clarify, I believe CDAT from root ports is sufficient for the > > > > > other direction (GI on CXL, memory in host). I wondered initially if > > > > > this was a two way issue, but after a reread, I think that is fine > > > > > with the root port providing CDAT or potentially treating the root > > > > > port as a GI (though that runs into the same naming / representation issue > > > > > as below and I think would need some clarifying text in UEFI GI description) > > > > > > > > > > http://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf > > > > > > > > > > For the case you are dealing with here potentially we 'could' add something > > > > > to CDAT as alternative to changing SRAT, but it would be more complex > > > > > so your approach here makes more sense to me. > > > > > > > > CDAT seems the wrong mechanism because it identifies target > > > > performance once you're at the front door of the device, not > > > > performance relative to a given initiator. > > > > > > I'd argue you could make CDAT a more symmetric representation, but it would > > > end up replicating a lot of info already in HMAT. Didn't say it was a good > > > idea! > > > > CDAT describes points, HMAT describes edges on the performance graph, > > it would be confusing if CDAT tried to supplant HMAT. > > Entirely agreed. Note I'm not disagreeing with approach here at all, simply > trying to point out where I think you'll get questions taking this to ACPI. Understood. > > > > > > > > > That's an odd situation that it sort of 'half' manages it in the BIOS. > > > We probably need some supplementary additional docs around this topic > > > as the OS would need to be aware of that possibility and explicitly check > > > for it before doing its normal build based on CDAT + what you are proposing > > > here. Maybe code is enough but given this is cross OS stuff I'd argue > > > it probably isn't. > > > > > > I guess could revisit this draft Uefi white paper perhaps and add a bunch > > > of examples around this usecase https://github.com/hisilicon/acpi-numa-whitepaper > > > > Thanks for the reference, I'll take a look. > > > > > > > > > > > > > > > > > > > > > > > > > > # Impact of the Change > > > > > > The existing Generic Initiator Affinity Structure (ACPI 6.4 Section > > > > > > 5.2.16.6) already contains all the fields necessary to enumerate a > > > > > > generic target proximity domain. All that is missing is the > > > > > > interpretation of that proximity domain optionally as a target > > > > > > identifier in the HMAT. > > > > > > > > > > > > Given that the OS still needs to dynamically enumerate and instantiate > > > > > > the memory ranges behind the host bridge. The assumption is that > > > > > > operating systems that do not support native CXL enumeration will ignore > > > > > > this data in the HMAT, while CXL native enumeration aware environments > > > > > > will use this fragment of the performance path to calculate the > > > > > > performance characteristics. > > > > > > > > > > I don't think it is true that OS not supporting native CXL will ignore the > > > > > data. > > > > > > > > True, I should have chosen more careful words like s/ignore/not > > > > regress upon seeing/ > > > > > > It's a sticky corner and I suspect likely to come up at in ACPI WG - what is > > > being proposed here isn't backwards compatible > > > > It seems our definitions of backwards compatible are divergent. Please > > correct me if I'm wrong, but I understand your position to be "any > > perceptible OS behavior change breaks backwards compatibility", > > whereas I'm advocating that backwards compatibility is relative > > regressing real world use cases. That said, I do need to go mock this > > up in QEMU and verify how much disturbance it causes. > > > > > even if the impacts in Linux are small. > > > > I'd note the kernel would grind to a halt if the criteria for > > "backwards compatible" was zero perceptible behavior change. > > Key thing here is the difference between Linux backwards compatibility and > ACPI backwards compatibility. Linux impacts are easily understood because > we can go look. ACPI change impacts in general are rather trickier to > define. > > Note currently I'm fine with this change, though think perhaps it could > be improved - simply raising that others may not be. I appreciate that, and as the originator of GI I hold your feedback in high regard. Yes, the criteria that it needs to be a functional regression vs a behavior change may be difficult argument to carry with others. > > > > > > Mostly it's infrastructure bring up that won't get used > > > (fallback lists and similar for a node which will never be specified in > > > allocations) and some confusing userspace ABI (which is more than a little > > > confusing already). > > > > Fallback lists are established relative to online nodes. These generic > > targets are not onlined as memory. > > This doesn't really matter beyond the point that there is stuff done for > a GI only node currently that doesn't make any sense for this new usecase. > > Still, you got me digging... > > I may have the wrong term there, but GI nodes have a zone list > generated that reflects where an allocation will go if an allocation > happens that is requested on the the GI node. > > devm_kzalloc() for example for a device in one of these nodes has to know > where to allocate memory from. Similar true for other resources that don't > exist in the GI node. > > Check this worked as expected led me down a rabbit hole. I knew the end > to end test did what I expected but wasn't certainly why... > > Anyhow, at least for my test case on x86 doing an 8k devm_kzalloc() it works > as follows. > > Some way down the nest of calls we get a call to > __alloc_pages_nodemask() > -> prepare_alloc_pages() // this finds the zonelist > -> node_listlist(gi_node, ...) > -> NODE_DATA(gi_node)->node_zonelists + gfp_zonelist(flags) > node_data[gi_node]->node_zonelists + gfp_zonelist(flags) > > The node_data entry for the gi_node has it's own zonelist which is used in this > path. The first entry of which is an appropriate zone on the nearest node which > actually has some memory. > > The build is referenced by a comment in x86/mm/numa.c under init_memory_less_node() > "All zonelists will be built later in start_kernel() after per cpu areas are initialized" > > build_zonelists() does the expected building based on SLIT distances. Ugh, I had been holding onto the mental model that GI support was limited to the HMAT sysfs representation. What is the use case for init_gi_nodes()? I would expect GI information is only useful for performance calculations and that can be done without touching the node infrastructure. At first glance that metadata manipulation looks like a mistake to me, I wish I had paid more attention as that went in... regardless this does indeed change my thoughts of Generic Target being a minor impact. Why does GI need anything more than acpi_map_pxm_to_node() to have a node number assigned?