Re: [RFC] ACPI Code First ECR: Generic Target

Dan Williams <dan.j.williams@xxxxxxxxx> · Tue, 16 Feb 2021 08:29:01 -0800

On Tue, Feb 16, 2021 at 3:08 AM Jonathan Cameron
<Jonathan.Cameron@xxxxxxxxxx> wrote:
[..]
> > Why does GI need anything more than acpi_map_pxm_to_node() to have a
> > node number assigned?
>
> It might have been possible (with limitations) to do it by making multiple
> proximity domains map to a single numa node, along with some additional
> functionality to allow it to retrieve the real node for aware drivers,
> but seeing as we already had the memoryless node infrastructure in place,
> it fitted more naturally into that scheme.  GI introduction to the
> ACPI spec, and indeed the kernel was originally driven by the needs of
> CCIX (before CXL was public) with CCIX's symmetric view of initiators
> (CPU or other) + a few other existing situations where we'd been
> papering over the topology for years and paying a cost in custom
> load balancing in drivers etc. That more symmetric view meant that the
> natural approach was to treat these as memoryless nodes.
>
> The full handling of nodes is needed to deal with situations like
> the following contrived setup. With a few interconnect
> links I haven't bothered drawing, there are existing systems where
> a portion of the topology looks like this:
>
>
>     RAM                              RAM             RAM
>      |                                |               |
>  --------        ---------        --------        --------
> | a      |      | b       |      | c      |      | d      |
> |   CPUs |------|  PCI RC |------| CPUs   |------|  CPUs  |
> |        |      |         |      |        |      |        |
>  --------        ---------        --------        --------
>                      |
>                   PCI EP
>
> We need the GI representation to allow an "aware" driver to understand
> that the PCI EP is equal distances from CPUs and RAM on (a) and (c),
> (and that using allocations from (d) is a a bad idea).  This would be
> the same as a driver running on an PCI RC attached to a memoryless
> CPU node (you would hope no one would build one of those, but I've seen
> them occasionally).  Such an aware driver carefully places both memory
> and processing threads / interrupts etc to balance the load.

That's an explanation for why GI exists, not an explanation for why a
GI needs to be anything more than translated to a Linux numa node
number and an api to lookup distance.

>
> In pre GI days, can just drop (b) into (a or c) and not worry about it, but
> that comes with a large performance cost (20% plus on network throughput
> on some of our more crazy systems, due to it appearing that balancing
> memory load across (a) and (c) doesn't make sense).  Also, if we happened
> to drop it into (c) then once we run out of space on (c) we'll start
> using (d) which is a bad idea.
>
> With GI nodes, you need an unaware PCI driver to work well and they
> will use allocations linked to the particular NUMA node that are in.
> The kernel needs to know a reasonable place to shunt them to and in
> more complex topologies the zone list may not correspond to that of
> any other node.

The kernel "needs", no it doesn't. Look at the "target_node" handling
for PMEM. Those nodes are offline, the distance can be determined, and
only when they become memory does the node become online.

The only point I can see GI needing anything more than the equivalent
of "target_node" is when the scheduler can submit jobs to GI
initiators like a CPU. Otherwise, GI is just a seed for a node number
plus numa distance.

>   In a CCIX world for example, a GI can sit between
> a pair of Home Agents with memory, and the host on the other side of
> them.  We had a lot of fun working through these cases back when drawing
> up the ACPI changes to support them. :)
>

Yes, I can imagine several interesting ACPI cases, but still
struggling to justify the GI zone list metadata.