On Mon, Oct 18, 2021 at 2:25 AM Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: > > On Fri, 15 Oct 2021 11:58:36 -0700 > Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > > On Fri, Oct 15, 2021 at 10:00 AM Jonathan Cameron > > <Jonathan.Cameron@xxxxxxxxxx> wrote: > > > > > > On Fri, 8 Oct 2021 18:53:39 -0700 > > > <alison.schofield@xxxxxxxxx> wrote: > > > > > > > From: Alison Schofield <alison.schofield@xxxxxxxxx> > > > > > > > > During NUMA init, CXL memory defined in the SRAT Memory Affinity > > > > subtable may be assigned to a NUMA node. Since there is no > > > > requirement that the SRAT be comprehensive for CXL memory another > > > > mechanism is needed to assign NUMA nodes to CXL memory not identified > > > > in the SRAT. > > > > > > > > Use the CXL Fixed Memory Window Structure's (CFMWS) of the ACPI CXL > > > > Early Discovery Table (CEDT) to find all CXL memory ranges. Create a > > > > NUMA node for each range that is not already assigned to a NUMA node. > > > > Add a memblk attaching its host physical address range to the node. > > > > > > > > Note that these ranges may not actually map any memory at boot time. > > > > They may describe persistent capacity or may be present to enable > > > > hot-plug. > > > > > > > > Consumers can use phys_to_target_node() to discover the NUMA node. > > > > > > > > Signed-off-by: Alison Schofield <alison.schofield@xxxxxxxxx> > > > Hi Alison, > > > > > > I'm not sure that a CFMWS entry should map to a single NUMA node... > > > > > > Each entry corresponds to a contiguous HPA range into which CXL devices > > > below a set of ports (if interleaved) or one port should be mapped. > > > > > > That could be multiple devices, each with it's own performance characteristics, > > > or potentially a mix of persistent and volatile memory on a system with limited > > > qtg groups. > > > > > > Maybe it's the best we can do though given information available > > > before any devices are present. > > > > > > > Regardless of the performance of the individual devices they can only > > map to one of the available CFMWS entries. So the maximum number of > > degrees of freedom is one node per CFMWS. Now if you have only one > > entry to pick from, but have interleave sets with widely different > > performance characteristics to online it becomes a policy decision > > about whether to force map those interleave sets into the same node, > > and that policy can be maintained outside the kernel. > > > > The alternative is to rework NUMA nodes to be something that can be > > declared dynamically as currently there are assumptions throughout the > > kernel that num_possible_nodes() is statically determined early in > > boot. I am not seeing strong evidence that complexity needs to be > > tackled in the near term, and "NUMA-node per CFMWS" should (famous > > last words) serve CXL needs for the foreseeable future. > > I'm less optimistic we won't end up revisiting this in the medium > term but can tackle that when we have better visibility of what > people are actually building. Agree. When we were game planning this patch internally the 2 options were, build full support for defining new NUMA nodes after boot, or just extend the boot-time NUMA node possibilities minimally by the declared degrees of freedom in the CFMWS. The latter path was taken because it gets us "80%" of what CXL needs without precluding going the former path later if that remaining "20% proves critical to add finer grained dynamic support.