Re: [RFC] ACPI Code First ECR: Generic Target

Dan Williams <dan.j.williams@xxxxxxxxx> · Fri, 12 Feb 2021 15:51:42 -0800

On Fri, Feb 12, 2021 at 4:26 AM Jonathan Cameron
<Jonathan.Cameron@xxxxxxxxxx> wrote:
>
> On Thu, 11 Feb 2021 09:06:51 -0800
> Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
>
> > On Thu, Feb 11, 2021 at 1:44 AM Jonathan Cameron
> > <Jonathan.Cameron@xxxxxxxxxx> wrote:
> > >
> > > On Wed, 10 Feb 2021 08:24:51 -0800
> > > Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> > >
> > > > On Wed, Feb 10, 2021 at 3:24 AM Jonathan Cameron
> > > > <Jonathan.Cameron@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Tue, 9 Feb 2021 19:55:05 -0800
> > > > > Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> > > > >
> > > > > > While the platform BIOS is able to describe the performance
> > > > > > characteristics of CXL memory that is present at boot, it is unable to
> > > > > > statically enumerate the performance of CXL memory hot inserted
> > > > > > post-boot. The OS can enumerate most of the characteristics from link
> > > > > > registers and CDAT, but the performance from the CPU to the host
> > > > > > bridge, for example, is not enumerated by PCIE or CXL. Introduce an
> > > > > > ACPI mechanism for this purpose. Critically this is achieved with a
> > > > > > small tweak to how the existing Generic Initiator proximity domain is
> > > > > > utilized in the HMAT.
> > > > >
> > > > > Hi Dan,
> > > > >
> > > > > Agree there is a hole here, but I think the proposed solution has some
> > > > > issues for backwards compatibility.
> > > > >
> > > > > Just to clarify, I believe CDAT from root ports is sufficient for the
> > > > > other direction (GI on CXL, memory in host).  I wondered initially if
> > > > > this was a two way issue, but after a reread, I think that is fine
> > > > > with the root port providing CDAT or potentially treating the root
> > > > > port as a GI (though that runs into the same naming / representation issue
> > > > > as below and I think would need some clarifying text in UEFI GI description)
> > > > >
> > > > > http://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> > > > >
> > > > > For the case you are dealing with here potentially we 'could' add something
> > > > > to CDAT as alternative to changing SRAT, but it would be more complex
> > > > > so your approach here makes more sense to me.
> > > >
> > > > CDAT seems the wrong mechanism because it identifies target
> > > > performance once you're at the front door of the device, not
> > > > performance relative to a given initiator.
> > >
> > > I'd argue you could make CDAT a more symmetric representation, but it would
> > > end up replicating a lot of info already in HMAT.  Didn't say it was a good
> > > idea!
> >
> > CDAT describes points, HMAT describes edges on the performance graph,
> > it would be confusing if CDAT tried to supplant HMAT.
>
> Entirely agreed.  Note I'm not disagreeing with approach here at all, simply
> trying to point out where I think you'll get questions taking this to ACPI.

Understood.

>
> >
> > >
> > > That's an odd situation that it sort of 'half' manages it in the BIOS.
> > > We probably need some supplementary additional docs around this topic
> > > as the OS would need to be aware of that possibility and explicitly check
> > > for it before doing its normal build based on CDAT + what you are proposing
> > > here.  Maybe code is enough but given this is cross OS stuff I'd argue
> > > it probably isn't.
> > >
> > > I guess could revisit this draft Uefi white paper perhaps and add a bunch
> > > of examples around this usecase https://github.com/hisilicon/acpi-numa-whitepaper
> >
> > Thanks for the reference, I'll take a look.
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > # Impact of the Change
> > > > > > The existing Generic Initiator Affinity Structure (ACPI 6.4 Section
> > > > > > 5.2.16.6) already contains all the fields necessary to enumerate a
> > > > > > generic target proximity domain. All that is missing is the
> > > > > > interpretation of that proximity domain optionally as a target
> > > > > > identifier in the HMAT.
> > > > > >
> > > > > > Given that the OS still needs to dynamically enumerate and instantiate
> > > > > > the memory ranges behind the host bridge. The assumption is that
> > > > > > operating systems that do not support native CXL enumeration will ignore
> > > > > > this data in the HMAT, while CXL native enumeration aware environments
> > > > > > will use this fragment of the performance path to calculate the
> > > > > > performance characteristics.
> > > > >
> > > > > I don't think it is true that OS not supporting native CXL will ignore the
> > > > > data.
> > > >
> > > > True, I should have chosen more careful words like s/ignore/not
> > > > regress upon seeing/
> > >
> > > It's a sticky corner and I suspect likely to come up at in ACPI WG - what is
> > > being proposed here isn't backwards compatible
> >
> > It seems our definitions of backwards compatible are divergent. Please
> > correct me if I'm wrong, but I understand your position to be "any
> > perceptible OS behavior change breaks backwards compatibility",
> > whereas I'm advocating that backwards compatibility is relative
> > regressing real world use cases. That said, I do need to go mock this
> > up in QEMU and verify how much disturbance it causes.
> >
> > > even if the impacts in Linux are small.
> >
> > I'd note the kernel would grind to a halt if the criteria for
> > "backwards compatible" was zero perceptible behavior change.
>
> Key thing here is the difference between Linux backwards compatibility and
> ACPI backwards compatibility.  Linux impacts are easily understood because
> we can go look.  ACPI change impacts in general are rather trickier to
> define.
>
> Note currently I'm fine with this change, though think perhaps it could
> be improved - simply raising that others may not be.

I appreciate that, and as the originator of GI I hold your feedback in
high regard.

Yes, the criteria that it needs to be a functional regression vs a
behavior change may be difficult argument to carry with others.

>
> >
> > > Mostly it's infrastructure bring up that won't get used
> > > (fallback lists and similar for a node which will never be specified in
> > > allocations) and some confusing userspace ABI (which is more than a little
> > > confusing already).
> >
> > Fallback lists are established relative to online nodes. These generic
> > targets are not onlined as memory.
>
> This doesn't really matter beyond the point that there is stuff done for
> a GI only node currently that doesn't make any sense for this new usecase.
>
> Still, you got me digging...
>
> I may have the  wrong term there, but GI nodes have a zone list
> generated that reflects where an allocation will go if an allocation
> happens that is requested on the the GI node.
>
> devm_kzalloc() for example for a device in one of these nodes has to know
> where to allocate memory from.  Similar true for other resources that don't
> exist in the GI node.
>
> Check this worked as expected led me down a rabbit hole.  I knew the end
> to end test did what I expected but wasn't certainly why...
>
> Anyhow, at least for my test case on x86 doing an 8k devm_kzalloc() it works
> as follows.
>
> Some way down the nest of calls we get a call to
> __alloc_pages_nodemask()
> -> prepare_alloc_pages() // this finds the zonelist
>    -> node_listlist(gi_node, ...)
>       -> NODE_DATA(gi_node)->node_zonelists + gfp_zonelist(flags)
>          node_data[gi_node]->node_zonelists + gfp_zonelist(flags)
>
> The node_data entry for the gi_node has it's own zonelist which is used in this
> path.  The first entry of which is an appropriate zone on the nearest node which
> actually has some memory.
>
> The build is referenced by a comment in x86/mm/numa.c under init_memory_less_node()
> "All zonelists will be built later in start_kernel() after per cpu areas are initialized"
>
> build_zonelists() does the expected building based on SLIT distances.

Ugh, I had been holding onto the mental model  that GI support was
limited to the HMAT sysfs representation. What is the use case for
init_gi_nodes()? I would expect GI information is only useful for
performance calculations and that can be done without touching the
node infrastructure.

At first glance that metadata manipulation looks like a mistake to me,
I wish I had paid more attention as that went in... regardless this
does indeed change my thoughts of Generic Target being a minor impact.

Why does GI need anything more than acpi_map_pxm_to_node() to have a
node number assigned?