On Tue, Jul 30, 2024 at 09:12:55AM +0800, Huang, Ying wrote: > Gregory Price <gourry@xxxxxxxxxx> writes: > > > On Mon, Jul 29, 2024 at 09:02:33AM +0800, Huang, Ying wrote: > >> Gregory Price <gourry@xxxxxxxxxx> writes: > >> > >> > In the event that hmat data is not available for the DRAM tier, > >> > or if it is invalid (bandwidth or latency is 0), we can still register > >> > a callback to calculate the abstract distance for non-cpu nodes > >> > and simply assign it a different tier manually. > >> > > >> > In the case where DRAM HMAT values are missing or not sane we > >> > manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE). > >> > > >> > If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we > >> > cannot reasonable determine where to place the tier, so it will default > >> > to MEMTIER_ADISTANCE_DRAM (which is the existing behavior). > >> > >> Why do we need this? Do you have machines with broken HMAT table? Can > >> you ask the vendor to fix the HMAT table? > >> > > > > It's a little unclear from the ACPI specification whether HMAT is > > technically optional or not (given that the kernel handles missing HMAT > > gracefully, it certainly seems optional). In one scenario I have seen > > incorrect data, and in another scenario I have seen the HMAT omitted > > entirely. In another scenario I have seen the HMAT-SLLBI omitted while > > the CDAT is present. > > IIUC, HMAT is optional. Is it possible for you to ask the system vendor > to fix the broken HMAT table. > In this case we are (BW=0), but in the other cases, there is technically nothing broken. That's my concern. > > In all scenarios the result is the same: all nodes in the same tier. > > I don't think so, in drivers/dax/kmem.c, we will put memory devices > onlined by kmem.c in another tier by default. > This presumes driver configured devices, which is not always the case. kmem.c will set MEMTIER_DEFAULT_DAX_ADISTANCE but if BIOS/EFI has set up the node instead, you get the default of MEMTIER_ADISTANCE_DRAM if HMAT is not present or otherwise not sane. Not everyone is going to have the ability to get a platform vendor to fix a BIOS bug, and I've seen this in production. > > The HMAT is explicitly described as "A hint" in the ACPI spec. > > > > ACPI 5.2.28.1 HMAT Overview > > > > "The software is expected to use this information as a hint for > > optimization, or when the system has heterogeneous memory" > > > > If something is "a hint", then it should not be used prescriptively. > > > > Right now HMAT appears to be used prescriptively, this despite the fact > > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in > > the memory-tier code. So this patch simply realizes this intent when the > > hints are not very reasonable. > > If HMAT isn't available, it's hard to put memory devices to > appropriate memory tiers without other information. Not having a CPU is "other information". What tier a device belongs to is really arbitrary, "appropriate" is at best a codified opinion. > In commit > 992bf77591cb ("mm/demotion: add support for explicit memory tiers"), > Aneesh pointed out that it doesn't work for his system to put > non-CPU-nodes in lower tier. > This seems like a bug / something else incorrect. I will investigate. > Even if we want to use other information to put memory devices to memory > tiers, we can register another adist calculation callback instead of > reusing hmat callback. > I suppose during init, we could register a default adist callback with CPU/non-CPU checks if HMAT is not sane. I can look at that. It might also be worth having some kind of modal mechanism, like: echo "auto" > /sys/.../memory_tiering/mode # Auto select mode echo "hmat" > /sys/.../memory_tiering/mode # Use HMAT Info echo "simple" > /sys/.../memory_tiering/mode # CPU vs non-CPU Node echo "topology" > /sys/.../memory_tiering/mode # More complex To abstract away the hardware complexities as best as possible. But the first step here would be creating two modes. HMAT-is-sane and CPU/Non-CPU seems reasonable to me but open to opinions. ~Gregory