RE: [PATCH v2 1/2] x86/numa: Introduce numa_fill_memblks()

Dan Williams <dan.j.williams@xxxxxxxxx> · Wed, 14 Jun 2023 05:44:19 -0700

alison.schofield@ wrote:
> From: Alison Schofield <alison.schofield@xxxxxxxxx>
> 
> numa_fill_memblks() fills in the gaps in numa_meminfo memblks
> over an HPA address range.
> 
> The ACPI driver will use numa_fill_memblks() to implement a new Linux
> policy that prescribes extending proximity domains in a portion of a
> CFMWS window to the entire window.
> 
> Dan Williams offered this explanation of the policy:
> A CFWMS is an ACPI data structure that indicates *potential* locations
> where CXL memory can be placed. It is the playground where the CXL
> driver has free reign to establish regions. That space can be populated
> by BIOS created regions, or driver created regions, after hotplug or
> other reconfiguration.
> 
> When BIOS creates a region in a CXL Window it additionally describes
> that subset of the Window range in the other typical ACPI tables SRAT,
> SLIT, and HMAT. The rationale for BIOS not pre-describing the entire
> CXL Window in SRAT, SLIT, and HMAT is that it can not predict the
> future. I.e. there is nothing stopping higher or lower performance
> devices being placed in the same Window. Compare that to ACPI memory
> hotplug that just onlines additional capacity in the proximity domain
> with little freedom for dynamic performance differentiation.
> 
> That leaves the OS with a choice, should unpopulated window capacity
> match the proximity domain of an existing region, or should it allocate
> a new one? This patch takes the simple position of minimizing proximity
> domain proliferation by reusing any proximity domain intersection for
> the entire Window. If the Window has no intersections then allocate a
> new proximity domain. Note that SRAT, SLIT and HMAT information can be
> enumerated dynamically in a standard way from device provided data.
> Think of CXL as the end of ACPI needing to describe memory attributes,
> CXL offers a standard discovery model for performance attributes, but
> Linux still needs to interoperate with the old regime.
> 
> Reported-by: Derick Marks <derick.w.marks@xxxxxxxxx>
> Suggested-by: Dan Williams <dan.j.williams@xxxxxxxxx>
> Signed-off-by: Alison Schofield <alison.schofield@xxxxxxxxx>
> Tested-by: Derick Marks <derick.w.marks@xxxxxxxxx>
[..]
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 2aadb2019b4f..fa82141d1a04 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
[..]
> @@ -961,4 +962,90 @@ int memory_add_physaddr_to_nid(u64 start)
>  	return nid;
>  }
>  EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> +
> +static int __init cmp_memblk(const void *a, const void *b)
> +{
> +	const struct numa_memblk *ma = *(const struct numa_memblk **)a;
> +	const struct numa_memblk *mb = *(const struct numa_memblk **)b;
> +
> +	if (ma->start != mb->start)
> +		return (ma->start < mb->start) ? -1 : 1;
> +
> +	/* Caller handles duplicate start addresses */
> +	return 0;

This can be simplified to:

static int __init cmp_memblk(const void *a, const void *b)
{
	const struct numa_memblk *ma = *(const struct numa_memblk **)a;
	const struct numa_memblk *mb = *(const struct numa_memblk **)b;

	return ma->start - mb->start;
}

> +}
> +
> +static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata;
> +
> +/**
> + * numa_fill_memblks - Fill gaps in numa_meminfo memblks
> + * @start: address to begin fill
> + * @end: address to end fill
> + *
> + * Find and extend numa_meminfo memblks to cover the @start-@end
> + * HPA address range, such that the first memblk includes @start,
> + * the last memblk includes @end, and any gaps in between are
> + * filled.
> + *
> + * RETURNS:
> + * 0		  : Success
> + * NUMA_NO_MEMBLK : No memblk exists in @start-@end range
> + */
> +
> +int __init numa_fill_memblks(u64 start, u64 end)
> +{
> +	struct numa_memblk **blk = &numa_memblk_list[0];
> +	struct numa_meminfo *mi = &numa_meminfo;
> +	int count = 0;
> +	u64 prev_end;
> +
> +	/*
> +	 * Create a list of pointers to numa_meminfo memblks that
> +	 * overlap start, end. Exclude (start == bi->end) since
> +	 * end addresses in both a CFMWS range and a memblk range
> +	 * are exclusive.
> +	 *
> +	 * This list of pointers is used to make in-place changes
> +	 * that fill out the numa_meminfo memblks.
> +	 */

Thanks for this comment, looks good.

> +	for (int i = 0; i < mi->nr_blks; i++) {
> +		struct numa_memblk *bi = &mi->blk[i];
> +
> +		if (start < bi->end && end >= bi->start) {
> +			blk[count] = &mi->blk[i];
> +			count++;
> +		}
> +	}
> +	if (!count)
> +		return NUMA_NO_MEMBLK;
> +
> +	/* Sort the list of pointers in memblk->start order */
> +	sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL);
> +
> +	/* Make sure the first/last memblks include start/end */
> +	blk[0]->start = min(blk[0]->start, start);
> +	blk[count - 1]->end = max(blk[count - 1]->end, end);
> +
> +	/*
> +	 * Fill any gaps by tracking the previous memblks end address,
> +	 * prev_end, and backfilling to it if needed. Avoid filling
> +	 * overlapping memblks by making prev_end monotonically non-
> +	 * decreasing.

I am not immediately understanding the use of the term monotonically
non-decreasing here? I think the first sentence of this comment is
enough, or am I missing a nuance?

> +	 */
> +	prev_end = blk[0]->end;
> +	for (int i = 1; i < count; i++) {
> +		struct numa_memblk *curr = blk[i];
> +
> +		if (prev_end >= curr->start) {
> +			if (prev_end < curr->end)
> +				prev_end = curr->end;
> +		} else {
> +			curr->start = prev_end;
> +			prev_end = curr->end;
> +		}
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(numa_fill_memblks);

This export is not needed. The only caller of this is
drivers/acpi/numa/srat.c which is only ever built-in, not a module.