On Tue, Feb 16, 2021 at 3:08 AM Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: [..] > > Why does GI need anything more than acpi_map_pxm_to_node() to have a > > node number assigned? > > It might have been possible (with limitations) to do it by making multiple > proximity domains map to a single numa node, along with some additional > functionality to allow it to retrieve the real node for aware drivers, > but seeing as we already had the memoryless node infrastructure in place, > it fitted more naturally into that scheme. GI introduction to the > ACPI spec, and indeed the kernel was originally driven by the needs of > CCIX (before CXL was public) with CCIX's symmetric view of initiators > (CPU or other) + a few other existing situations where we'd been > papering over the topology for years and paying a cost in custom > load balancing in drivers etc. That more symmetric view meant that the > natural approach was to treat these as memoryless nodes. > > The full handling of nodes is needed to deal with situations like > the following contrived setup. With a few interconnect > links I haven't bothered drawing, there are existing systems where > a portion of the topology looks like this: > > > RAM RAM RAM > | | | > -------- --------- -------- -------- > | a | | b | | c | | d | > | CPUs |------| PCI RC |------| CPUs |------| CPUs | > | | | | | | | | > -------- --------- -------- -------- > | > PCI EP > > We need the GI representation to allow an "aware" driver to understand > that the PCI EP is equal distances from CPUs and RAM on (a) and (c), > (and that using allocations from (d) is a a bad idea). This would be > the same as a driver running on an PCI RC attached to a memoryless > CPU node (you would hope no one would build one of those, but I've seen > them occasionally). Such an aware driver carefully places both memory > and processing threads / interrupts etc to balance the load. That's an explanation for why GI exists, not an explanation for why a GI needs to be anything more than translated to a Linux numa node number and an api to lookup distance. > > In pre GI days, can just drop (b) into (a or c) and not worry about it, but > that comes with a large performance cost (20% plus on network throughput > on some of our more crazy systems, due to it appearing that balancing > memory load across (a) and (c) doesn't make sense). Also, if we happened > to drop it into (c) then once we run out of space on (c) we'll start > using (d) which is a bad idea. > > With GI nodes, you need an unaware PCI driver to work well and they > will use allocations linked to the particular NUMA node that are in. > The kernel needs to know a reasonable place to shunt them to and in > more complex topologies the zone list may not correspond to that of > any other node. The kernel "needs", no it doesn't. Look at the "target_node" handling for PMEM. Those nodes are offline, the distance can be determined, and only when they become memory does the node become online. The only point I can see GI needing anything more than the equivalent of "target_node" is when the scheduler can submit jobs to GI initiators like a CPU. Otherwise, GI is just a seed for a node number plus numa distance. > In a CCIX world for example, a GI can sit between > a pair of Home Agents with memory, and the host on the other side of > them. We had a lot of fun working through these cases back when drawing > up the ACPI changes to support them. :) > Yes, I can imagine several interesting ACPI cases, but still struggling to justify the GI zone list metadata.