On Tue, Feb 16, 2021 at 10:08 AM Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: > > On Tue, 16 Feb 2021 08:29:01 -0800 > Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > > On Tue, Feb 16, 2021 at 3:08 AM Jonathan Cameron > > <Jonathan.Cameron@xxxxxxxxxx> wrote: > > [..] > > > > Why does GI need anything more than acpi_map_pxm_to_node() to have a > > > > node number assigned? > > > > > > It might have been possible (with limitations) to do it by making multiple > > > proximity domains map to a single numa node, along with some additional > > > functionality to allow it to retrieve the real node for aware drivers, > > > but seeing as we already had the memoryless node infrastructure in place, > > > it fitted more naturally into that scheme. GI introduction to the > > > ACPI spec, and indeed the kernel was originally driven by the needs of > > > CCIX (before CXL was public) with CCIX's symmetric view of initiators > > > (CPU or other) + a few other existing situations where we'd been > > > papering over the topology for years and paying a cost in custom > > > load balancing in drivers etc. That more symmetric view meant that the > > > natural approach was to treat these as memoryless nodes. > > > > > > The full handling of nodes is needed to deal with situations like > > > the following contrived setup. With a few interconnect > > > links I haven't bothered drawing, there are existing systems where > > > a portion of the topology looks like this: > > > > > > > > > RAM RAM RAM > > > | | | > > > -------- --------- -------- -------- > > > | a | | b | | c | | d | > > > | CPUs |------| PCI RC |------| CPUs |------| CPUs | > > > | | | | | | | | > > > -------- --------- -------- -------- > > > | > > > PCI EP > > > > > > We need the GI representation to allow an "aware" driver to understand > > > that the PCI EP is equal distances from CPUs and RAM on (a) and (c), > > > (and that using allocations from (d) is a a bad idea). This would be > > > the same as a driver running on an PCI RC attached to a memoryless > > > CPU node (you would hope no one would build one of those, but I've seen > > > them occasionally). Such an aware driver carefully places both memory > > > and processing threads / interrupts etc to balance the load. > > > > That's an explanation for why GI exists, not an explanation for why a > > GI needs to be anything more than translated to a Linux numa node > > number and an api to lookup distance. > > Why should a random driver need to know it needs to do something special? > > Random drivers don't lookup distance, they just allocate memory based on their > current numa_node. devm_kzalloc() does this under the hood (an optimization > that rather took me by surprise at the time). > Sure we could add a bunch of new infrastructure to solve that problem > but why not use what is already there? > > > > > > > > > In pre GI days, can just drop (b) into (a or c) and not worry about it, but > > > that comes with a large performance cost (20% plus on network throughput > > > on some of our more crazy systems, due to it appearing that balancing > > > memory load across (a) and (c) doesn't make sense). Also, if we happened > > > to drop it into (c) then once we run out of space on (c) we'll start > > > using (d) which is a bad idea. > > > > > > With GI nodes, you need an unaware PCI driver to work well and they > > > will use allocations linked to the particular NUMA node that are in. > > > The kernel needs to know a reasonable place to shunt them to and in > > > more complex topologies the zone list may not correspond to that of > > > any other node. > > > > The kernel "needs", no it doesn't. Look at the "target_node" handling > > for PMEM. Those nodes are offline, the distance can be determined, and > > only when they become memory does the node become online. > > Indeed, custom code for specific cases can work just fine (we've carried > plenty of it in the past to get best performance from systems), but for GIs > the intent was they would just work. We don't want to have to go and change > stuff in PCI drivers every time we plug a new card into such a system. > > > > > The only point I can see GI needing anything more than the equivalent > > of "target_node" is when the scheduler can submit jobs to GI > > initiators like a CPU. Otherwise, GI is just a seed for a node number > > plus numa distance. > > That would be true if Linux didn't already make heavy use of numa_node > for driver allocations. We could carry a parallel value of 'real_numa_node' > or something like that, but you can't safely use numa_node without the > node being online and zone lists present. > Another way of looking at it is that zone list is a cache solving the > question of where to allocate memory, which you could also solve using > the node number and distances (at the cost of custom handling). > > It is of course advantageous to do cleverer things for particular drivers > but the vast majority need to just work. > > > > > > In a CCIX world for example, a GI can sit between > > > a pair of Home Agents with memory, and the host on the other side of > > > them. We had a lot of fun working through these cases back when drawing > > > up the ACPI changes to support them. :) > > > > > > > Yes, I can imagine several interesting ACPI cases, but still > > struggling to justify the GI zone list metadata. > > It works. It solves the problem. It's very little extra code and it > exercises zero paths not already exercised by memoryless nodes. > We certainly wouldn't have invented something as complex as zone lists > if we couldn't leverage what was there of course. > > So I have the opposite view point. I can't see why the minor overhead > of zone list metadata for GIs isn't a sensible choice vs cost of > maintaining something entirely different. This only changes with the > intent to use them to represent something different. What I am missing is what zone-list metadata offers beyond just assigning the device-numa-node to the closest online memory node, and let the HMAT-sysfs representation enumerate the next level? For example, the persistent memory enabling assigns the closest online memory node for the pmem device. That achieves the traditional behavior of the device-driver allocating from "local" memory by default. However the HMAT-sysfs representation indicates the numa node that pmem represents itself were it to be online. So the question is why does GI need more than that? To me a GI is "offline" in terms Linux node representations because numactl can't target it, "closest online" is good enough for a GI device driver, but if userspace needs the next level of detail of the performance properties that's what HMEM sysfs is providing.