Re: [PATCH v3] ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT

Dan Williams <dan.j.williams@xxxxxxxxx> · Thu, 21 Oct 2021 19:01:17 -0700

On Thu, Oct 21, 2021 at 8:57 AM Vikram Sethi <vsethi@xxxxxxxxxx> wrote:
>
>
>
> > -----Original Message-----
> > From: Alison Schofield <alison.schofield@xxxxxxxxx>
> > Sent: Wednesday, October 20, 2021 8:00 PM
> > To: Vikram Sethi <vsethi@xxxxxxxxxx>
> > Cc: Rafael J. Wysocki <rafael@xxxxxxxxxx>; Len Brown <lenb@xxxxxxxxxx>;
> > Vishal Verma <vishal.l.verma@xxxxxxxxx>; Ira Weiny <ira.weiny@xxxxxxxxx>;
> > Ben Widawsky <ben.widawsky@xxxxxxxxx>; Dan Williams
> > <dan.j.williams@xxxxxxxxx>; linux-cxl@xxxxxxxxxxxxxxx; linux-
> > acpi@xxxxxxxxxxxxxxx
> > Subject: Re: [PATCH v3] ACPI: NUMA: Add a node and memblk for each
> > CFMWS not in SRAT
> >
> snip
>
> > > >
> > > > Consumers can use phys_to_target_node() to discover the NUMA node.
> > >
> > > Does this patch work for CXL type 2 memory which is not in SRAT? A
> > > type 2 driver can find its HDM BASE physical address from its CXL
> > > registers and figure out its NUMA node id by calling phys_to_target_node?
> >
> > Yes. This adds the nodes for the case where the BIOS doesn't fully describe
> > everything in CFMWS in the SRAT. And, yes, that is how the NUMA node can
> > be discovered.
> >
> > > Or is type 2 HDM currently being skipped altogether?
> >
> > Not sure what you mean by 'being skipped altogether'? The BIOS may
> > describe (all or none or some) of CXL Memory in the SRAT. In the case where
> > BIOS describes it all, NUMA nodes will already exist, and no new nodes will
> > be added here.
> >
> My question about skipping type2 wasn't directly related to your patch,
> but more of a question about current upstream support for probe/configuration
> of type 2 accelerator devices memory, irrespective of whether FW shows type 2
> memory in SRAT.

SRAT only has Type-2 ranges if the platform firmware maps the device's
memory into the EFI memory map (includes ACPI SRAT / SLIT / HMAT
population). I expect that situation to be negotiated on a case by
case basis between Type-2 device vendors and platform firmware
vendors. There is no requirement that any CXL memory, type-2 or
type-3, is mapped by platform firmware. Per the CDAT specification,
platform firmware is capable to map CXL into the EFI memory map at
boot, but there is no requirement for it to do so.

My expectation is that Linux will need to handle the full gamut of
possibilities here, i.e. all / some / none of the CXL Type-3 devices
present at boot mapped into the EFI memory map, and all / some / none
of the CXL Type-2 devices mapped into the EFI memory map.

> The desired outcome is that the kernel CXL driver recognizes such type 2 HDM,
> and assigns it a NUMA node such that the type 2 driver

Note that there's no driver involved at this point. Alison's patch is
just augmenting the ACPI declared NUMA nodes at boot so that the
core-mm is not surprised by undeclared NUMA nodes at
add_memory_driver_managed() time.

> can later add/online this memory,
> via add_memory_driver_managed which requires a NUMA node ID (which driver can
> discover after your patch by calling phys_to_target_node).

Yes, with this patch there are at least enough nodes for
add_memory_driver_managed() to have a reasonable answer for a NUMA
node for Type-2 memory. However, as Jonathan and I were discussing,
this minimum enabling may prove insufficient if, for example, you had
one CFMWS entry for all Type-2 memory in the system, but multiple
disparate accelerators that want to each do
add_memory_driver_managed(). In that scenario all of those
accelerators, which might want to have a target-node per
target-device, will all share one target-node. That said, unless and
until it becomes clear that system architectures require Linux to
define multiple nodes per CFMWS, I am happy to kick that can down the
road. Also, platform firmware can solve this problem by subdividing
Type-2 with multiple QTG ids so that multiple target devices can each
be assigned to a different CFMWS entry sandbox, i.e. with more degrees
of freedom declared by platform firmware in the CFMWS it relieves
pressure on the OS to need a dynamic NUMA node definition capability.

> Would the current upstream code for HDM work as described above,

Current upstream code that enumerates Type-2 is the cxl_acpi driver
that enumerates platform CXL capabilities.

> and if so, does it
> rely on CDAT DSEMTS structure showing a specific value for EFI memory type? i.e would it
> work if that field in DSEMTS was either EFI_CONVENTIONAL_MEMORY with EFI_MEMORY_SP,
> or EFI_RESERVED_MEMORY?

If platform firmware maps the HDM the expectation is that it will use
the CDAT to determine the EFI memory type. If platform firmware
declines to map the device and lets Linux map it then that's de-facto
"reserved" memory and the driver (generic CXL-Type-3 / or vendor
specific CXL-Type-2) gets to do insert_resource() with whatever Linux
type it deems appropriate, i.e. EFI is out of the picture in this
scenario.