RE: [PATCH v3] ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT

Vikram Sethi <vsethi@xxxxxxxxxx> · Fri, 22 Oct 2021 21:58:12 +0000

Hi Dan, 
Thanks for the detailed response. Some comments/questions inline. 

> -----Original Message-----
> From: Dan Williams <dan.j.williams@xxxxxxxxx>
> Sent: Thursday, October 21, 2021 9:01 PM
> To: Vikram Sethi <vsethi@xxxxxxxxxx>
> Cc: Alison Schofield <alison.schofield@xxxxxxxxx>; Rafael J. Wysocki
> <rafael@xxxxxxxxxx>; Len Brown <lenb@xxxxxxxxxx>; Vishal Verma
> <vishal.l.verma@xxxxxxxxx>; Ira Weiny <ira.weiny@xxxxxxxxx>; Ben
> Widawsky <ben.widawsky@xxxxxxxxx>; linux-cxl@xxxxxxxxxxxxxxx; linux-
> acpi@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH v3] ACPI: NUMA: Add a node and memblk for each
> CFMWS not in SRAT
> 
> On Thu, Oct 21, 2021 at 8:57 AM Vikram Sethi <vsethi@xxxxxxxxxx> wrote:
> >
> > > -----Original Message-----
> > > From: Alison Schofield <alison.schofield@xxxxxxxxx>
> > > Sent: Wednesday, October 20, 2021 8:00 PM
> > > To: Vikram Sethi <vsethi@xxxxxxxxxx>
> > > Cc: Rafael J. Wysocki <rafael@xxxxxxxxxx>; Len Brown
> > > <lenb@xxxxxxxxxx>; Vishal Verma <vishal.l.verma@xxxxxxxxx>; Ira
> > > Weiny <ira.weiny@xxxxxxxxx>; Ben Widawsky
> <ben.widawsky@xxxxxxxxx>;
> > > Dan Williams <dan.j.williams@xxxxxxxxx>; linux-cxl@xxxxxxxxxxxxxxx;
> > > linux- acpi@xxxxxxxxxxxxxxx
> > > Subject: Re: [PATCH v3] ACPI: NUMA: Add a node and memblk for each
> > > CFMWS not in SRAT
> > >
> > snip
> >
> > > > >
> > > > > Consumers can use phys_to_target_node() to discover the NUMA
> node.
> > > >
> > > > Does this patch work for CXL type 2 memory which is not in SRAT? A
> > > > type 2 driver can find its HDM BASE physical address from its CXL
> > > > registers and figure out its NUMA node id by calling
> phys_to_target_node?
> > >
> > > Yes. This adds the nodes for the case where the BIOS doesn't fully
> > > describe everything in CFMWS in the SRAT. And, yes, that is how the
> > > NUMA node can be discovered.
> > >
> > > > Or is type 2 HDM currently being skipped altogether?
> > >
> > > Not sure what you mean by 'being skipped altogether'? The BIOS may
> > > describe (all or none or some) of CXL Memory in the SRAT. In the
> > > case where BIOS describes it all, NUMA nodes will already exist, and
> > > no new nodes will be added here.
> > >
> > My question about skipping type2 wasn't directly related to your
> > patch, but more of a question about current upstream support for
> > probe/configuration of type 2 accelerator devices memory, irrespective
> > of whether FW shows type 2 memory in SRAT.
> 
> SRAT only has Type-2 ranges if the platform firmware maps the device's
> memory into the EFI memory map (includes ACPI SRAT / SLIT / HMAT
> population). I expect that situation to be negotiated on a case by case basis
> between Type-2 device vendors and platform firmware vendors. 

Agreed
> There is no requirement that any CXL memory, type-2 or type-3, is mapped by platform
> firmware. Per the CDAT specification, platform firmware is capable to map
> CXL into the EFI memory map at boot, but there is no requirement for it to do
> so.
> 
> My expectation is that Linux will need to handle the full gamut of possibilities
> here, i.e. all / some / none of the CXL Type-3 devices present at boot
> mapped into the EFI memory map, and all / some / none of the CXL Type-2
> devices mapped into the EFI memory map.
> 
Agreed. IIUC, if FW has initialized HDM base/decoder HPA in device, shown 
device HDM HPA in EFI Memory Map as reserved based on CDAT, and shown 
PXM for Device memory in SRAT, kernel can validate no conflicts in FW HDM HPA
assignments vs CFMWS, and then map the PXM from SRAT to a node ID.
The CFMWS DSM seems unnecessary for this case to get the NUMA assignments,
 right? Type 2 Driver calls phys_to_target_node() and then add_memory_driver_managed.

> > The desired outcome is that the kernel CXL driver recognizes such type
> > 2 HDM, and assigns it a NUMA node such that the type 2 driver
> 
> Note that there's no driver involved at this point. Alison's patch is just
> augmenting the ACPI declared NUMA nodes at boot so that the core-mm is
> not surprised by undeclared NUMA nodes at
> add_memory_driver_managed() time.
> 
> > can later add/online this memory,
> > via add_memory_driver_managed which requires a NUMA node ID (which
> > driver can discover after your patch by calling phys_to_target_node).
> 
> Yes, with this patch there are at least enough nodes for
> add_memory_driver_managed() to have a reasonable answer for a NUMA
> node for Type-2 memory. However, as Jonathan and I were discussing, this
> minimum enabling may prove insufficient if, for example, you had one
> CFMWS entry for all Type-2 memory in the system, but multiple disparate
> accelerators that want to each do add_memory_driver_managed(). 

CEDT CFMWS ECN says "The CFMWS structure describes zero or more Host Physical 
Address (HPA) windows associated with *each CXL Host Bridge*". 
So are you concerned about multiple type 2 accelerators under the same Host bridge? 
IIRC CXL 2.0 only allows 1 type 2 device under a host bridge, but perhaps that was 
under 1 "root port", and you are pointing out that a system that has multiple 
root ports under 1 CXL host bridge can have multiple CXL type 2 accelerators under it. 
Future revisions of the spec could always relax current restrictions even under 1 
root port, so I do see the problem about multiple type 2 devices under 1 CFMWS 
window wanting their own NUMA nodes. Apologies if I'm mangling the 
terminology for host bridge vs root port, but trying to correlate with PCIe terminology 
of RC/RP.

> In that scenario all of those accelerators, which might want to have a target-node
> per target-device, will all share one target-node. That said, unless and until it
> becomes clear that system architectures require Linux to define multiple
> nodes per CFMWS, I am happy to kick that can down the road. Also, platform
> firmware can solve this problem by subdividing
> Type-2 with multiple QTG ids so that multiple target devices can each be
> assigned to a different CFMWS entry sandbox, i.e. with more degrees of
> freedom declared by platform firmware in the CFMWS it relieves pressure on
> the OS to need a dynamic NUMA node definition capability.
> 
Let's say there are 2 type 2 accelerators with the same latency/bandwidth 
properties under a given CXL host bridge. When the DSM is executed on this host 
bridge with the latency/BW as input, IIUC you're saying FW could return 2 possible 
QTG IDs, let's say 4 and 5? And for that host bridge, FW also create 2 CFMWS 
windows of HPA for type 2, with one showing QTG ID =4, and the other showing 
QTG ID=5, with each having at least enough HPA space to map 1 of those devices?

> > Would the current upstream code for HDM work as described above,
> 
> Current upstream code that enumerates Type-2 is the cxl_acpi driver that
> enumerates platform CXL capabilities.
> 
> > and if so, does it
> > rely on CDAT DSEMTS structure showing a specific value for EFI memory
> > type? i.e would it work if that field in DSEMTS was either
> > EFI_CONVENTIONAL_MEMORY with EFI_MEMORY_SP, or
> EFI_RESERVED_MEMORY?
> 
> If platform firmware maps the HDM the expectation is that it will use the
> CDAT to determine the EFI memory type. If platform firmware declines to
> map the device and lets Linux map it then that's de-facto "reserved" memory
> and the driver (generic CXL-Type-3 / or vendor specific CXL-Type-2) gets to
> do insert_resource() with whatever Linux type it deems appropriate, i.e. EFI
> is out of the picture in this scenario.