On Fri, Mar 14, 2025 at 03:55:44PM +0800, Yuquan Wang wrote: > On Thu, Mar 13, 2025 at 09:28:46AM -0700, Alison Schofield wrote: > > On Thu, Mar 13, 2025 at 02:09:07PM +0800, Yuquan Wang wrote: > > > The absence of SRAT would cause the fake_pxm to be -1 and increment > > > to 0, then send to acpi_parse_cfmws(). If there exists CXL memory > > > ranges that are defined in the CFMWS and not already defined in the > > > SRAT, the new node (node0) for the CXL memory would be invalid, as > > > node0 is already in "used". > > > > > > This utilizes disable_srat() & srat_disabled() to fail CXL init. > > > > Seems like this fixup has drifted from adjusting the fake_pxm to > > shutting down CXL parsing. More below - > > > > > > > > Signed-off-by: Yuquan Wang <wangyuquan1236@xxxxxxxxxxxxxx> > > > --- > > > > > > Changes in v2: > > > - Add disable_srat() when fake_pxm is invalid > > > - Add srat_disabled() check in cxl_acpi_probe() and acpi_parse_cfmws() > > > > > > > > > drivers/acpi/numa/srat.c | 10 ++++++++++ > > > drivers/cxl/acpi.c | 4 ++++ > > > 2 files changed, 14 insertions(+) > > > > > > diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c > > > index 00ac0d7bb8c9..2dac25c9258a 100644 > > > --- a/drivers/acpi/numa/srat.c > > > +++ b/drivers/acpi/numa/srat.c > > > @@ -441,6 +441,11 @@ static int __init acpi_parse_cfmws(union acpi_subtable_headers *header, > > > start = cfmws->base_hpa; > > > end = cfmws->base_hpa + cfmws->window_size; > > > > > > + if (srat_disabled()) { > > > + pr_err("SRAT is missing or bad while processing CFMWS.\n"); > > > + return -EINVAL; > > > + } > > > + > > > > This goes too far by shutting down cfmws parsing for lack of SRAT. > > > > Actually, I thought there need another patch to fix the follow problem > that the fake node bring when no SRAT. > > Detailed description below. > > > > /* > > > * The SRAT may have already described NUMA details for all, > > > * or a portion of, this CFMWS HPA range. Extend the memblks > > > @@ -646,6 +651,11 @@ int __init acpi_numa_init(void) > > > if (node_to_pxm_map[i] > fake_pxm) > > > fake_pxm = node_to_pxm_map[i]; > > > } > > > + > > > + /* Make sure CFMWs fake nodes start at node[1] */ > > > + if (fake_pxm < 0) > > > + disable_srat(); > > > + > > > > How does the code above make sure fake node starts at node[1]? > > Would an explicit adjustment like this work? > > Thanks for your correction :) Yes, the way I used here is too implicit. > > > > > - last_real_pxm = fake_pxm; > > - fake_pxm++; > > + fake_pxm = max(fake_pxm, 1); > > + last_real_pxm = fake_pxm--; > > I tried the adjustment below: > > fake_pxm = max(fake_pxm, 0); // 0 because it will increment to 1 > last_real_pxm = fake_pxm++; > > This works but it might only control the parameter sent to acpi_parse_cfmws(). > According to acpi_map_pxm_to_node(), altough the input fake_pxm is 1 when no > SRAT, the returned node would still be 0 and the following nodes are aslo > incorrect. > > Hence, I tried add a new line below: > > fake_pxm = max(fake_pxm, 0); > last_real_pxm = fake_pxm++; > node_set(0, nodes_found_map); > > As no matter what situation, node[0] would be found and set. With this > setting, acpi_map_pxm_to_node() could return the expected node value > even if no SRAT. :( > > Unfortunately, when we use "cxl create-region" to enable our cxl memory, > it would still be assigned to node[0], because the "numa_add_memblk()" > can only add numa_memblk to numa_meminfo list. > > If our SRAT is OK, the numa_memblks_init() would then utilize > numa_move_tail_memblk() to move the numa_memblk from numa_meminfo to > numa_reserved_meminfo in CFMWs fake node situation. If SRAT is missing > or bad, the numa_memblks_init() would fail since init_func() would fail. > And it causes that no numa_memblk in numa_reserved_meminfo list and the > following dax&memory_hotplug drivers could not online the expected fake > node. > > Based on the above problem, I have a new patch idea that introduce a new > function in mm/numa_memblks.c: numa_add_reserved_memblk(). It could add > one numa_memblk to nuam_reserved_meminfo directly. Maybe we could call > it in acpi_parse_cfmws() if srat is missing. > > In mm/numa_memblks.c: > > int __init numa_add_reserved_memblk(int nid, u64 start, u64 end) > { > return numa_add_memblk_to(nid, start, end, &numa_reserved_meminfo); > } > > In drivers/acpi/numa/srat.c: > > if (srat_disabled()) { > if (numa_add_reserved_memblk(node, start, end) < 0) { > pr_warn("ACPI NUMA: Failed to add memblk for CFMWS node %d [mem %#llx-%#llx]\n", > node, start, end); > } > } > > :( But..., the dax_kmem driver will fail because something wrong in > memory_group_register_static(). The good result is our cxl memory would > not be assigned to node[0] anymore! > > BTW, as papering these things looks like not easily, I chose to aggressively > fail the acpi_parse_cfmws() in srat.c since it mainly works for building > cxl fake nodes and also fail the CXL init in cxl_acpi_probe per Jonathan. > > Link: https://lists.nongnu.org/archive/html/qemu-devel/2025-03/msg03668.html > > Hopes more comments to guide me! I'm a really rookie in kernel community :P > No worries Yuquan. This is how we do it. Keep iterating! I see you, Jonathan, and Gregory chatting about this a bit on the LSFMM thread, and they are probably following along better than I. I'm getting a bit lost in the diffs here in this thread. At this point, I suggest sending the next revision of this patch with an updated commit log describing the failing scenario, how the changes solve 'your' problem, and why it doesn't break anything. > > > last_real_pxm = fake_pxm; > > > fake_pxm++; > > > acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws, > > > diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c > > > index cb14829bb9be..e75a8ead99f6 100644 > > > --- a/drivers/cxl/acpi.c > > > +++ b/drivers/cxl/acpi.c > > > @@ -829,6 +829,10 @@ static int cxl_acpi_probe(struct platform_device *pdev) > > > if (rc) > > > return rc; > > > > > > + /* CXL must be in a NUMA system */ > > > + if (srat_disabled()) > > > + return -EINVAL; > > > + > > > cxl_res = devm_kzalloc(host, sizeof(*cxl_res), GFP_KERNEL); > > > if (!cxl_res) > > > return -ENOMEM; > > > -- > > > 2.34.1 > > > >