Hi Gregory, thanks for kind explanation. On Tue, 11 Mar 2025 00:42:49 -0400 Gregory Price <gourry@xxxxxxxxxx> wrote: > On Tue, Mar 11, 2025 at 01:02:07PM +0900, Yunjeong Mun wrote: > > forenote - Hi Andrew, please hold off on the auto-configuration patch > for now, the sk group has identified a hotplug issue we need to work out > and we'll likely need to merge these two patch set together. I really > appreciate your patience with this feature. > > > Hi Gregory, > > > > In my understanding, the reason we are seeing 12 NUMA node is because > > it loops through node_states[N_POSSIBLE] and its value is 4095 (twelves ones) > > in the code [1] below: > > > ... snip ... > > Appreciated, so yes this confirms what i thought was going on. There's > 4 host bridges, 2 devices on each host bridge, and an extra CFMWS per > socket that is intended to interleave across the host bridges. > > As you mention below, the code in acpi/numa/srat.c will create 1 NUMA > node per SRAT Memory Affinity Entry - and then also 1 NUMA node per > CFMWS that doesn't have a matching SRAT entry (with a known corner case > for a missing SRAT which doesn't apply here). > > So essentialy what the system is doing is marking that it's absolutely > possible to create 1 region per device and also 1 region that > interleaves across host each pair of host bridges (I presume this is a > dual socket system?). Correct, it is a dual socket system. Thank you for the detailed explanation. It has been helpful for analyzing the code. > > So, tl;dr: All these nodes are valid and this configuration is correct. > > Weighted interleave presently works fine as intended, but with the > inclusion of the auto-configuration, there will be issues for your > system configuration. This means we probably need to consider > merging these as a group. > We believe our propsed hot plug patch should be added as a hot-fix to v6.14, because it can be addressed independently of auto-configuring feature. Rakie will send v2 patch soon. > During boot, the following will occur > > 1) drivers/acpi/numa/srat.c marks 12 nodes as possible > 0-1) Socket nodes > 2-3) Cross-host-bridge interleave nodes > 4-11) single region nodes > > 2) drivers/cxl/* will probe the various devices and create > a root decoder for each CXL Fixed Memory Window > decoder0.0 - decoder11.0 (or maybe decoder0.0 - decoder0.11) > > 3) during probe auto-configuration of wieghted interleave occurs as a > result of this code being called with hmat or cdat data: > > void node_set_perf_attrs() { > ... > /* When setting CPU access coordinates, update mempolicy */ > if (access == ACCESS_COORDINATE_CPU) { > if (mempolicy_set_node_perf(nid, coord)) { > pr_info("failed to set mempolicy attrs for node %d\n", > nid); > } > } > ... > } > > under the current system, since we calculate with N_POSSIBLE, all nodes > will be assigned weights (assuming HMAT or CDAT data is available for > all of them). > > We actually have a few issues here > > 1) If all nodes are included in the weighting reduction, we're actually > over-representing a particular set of hardware. The interleave node > and the individual device nodes would actually over-represent the > bandwidth available (comparative to the CPU nodes). > > 2) As stated on this patch line, just switching to N_MEMORY causes > issues with hotplug - where the bandwidth can be reported, but if > memory hasn't been added yet then we'll end up with wrong weights > because it wasn't included in the calculation. > > 3) However, not exposing the nodes because N_MEMORY isn't set yet > a) prevents pre-configuration before memory is onlined, and > b) hides the implications of hotplugging memory into a node from the > user (adding memory causes a re-weight and may affect an > interleave-all configuration). > > but - i think it's reasonable that anyone using weighted-interleave is > *probably* not going to have nodes come and go. It just seems like a > corner case that isn't reasonable to spend time supporting. > > So coming back around to the hotplug patch line, I do think it's > reasonable hide nodes marked !N_MEMORY, but consider two issues: > > 1) In auto mode, we need to re-weight on hotplug to only include > onlined nodes. This is because the reduction may be sensitive > to the available bandwidth changes. > > This behavior needs to be clearly documented. > > 2) We need to clearly define what the weight of a node will be when > in manual mode and a node goes (memory -> no memory -> memory) > a) does it retain it's old, manually set weight? > b) does it revert to 1? > > Sorry for the long email, just working through all the implications. > > I think the proposed hotplug patch is a requirement for the > auto-configuration patch set. > > ~Gregory > Best regards, Yunjeong