Gregory Price <gourry@xxxxxxxxxx> writes: > On Tue, Jul 30, 2024 at 09:12:55AM +0800, Huang, Ying wrote: >> Gregory Price <gourry@xxxxxxxxxx> writes: >> >> > On Mon, Jul 29, 2024 at 09:02:33AM +0800, Huang, Ying wrote: >> >> Gregory Price <gourry@xxxxxxxxxx> writes: >> >> >> >> > In the event that hmat data is not available for the DRAM tier, >> >> > or if it is invalid (bandwidth or latency is 0), we can still register >> >> > a callback to calculate the abstract distance for non-cpu nodes >> >> > and simply assign it a different tier manually. >> >> > >> >> > In the case where DRAM HMAT values are missing or not sane we >> >> > manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE). >> >> > >> >> > If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we >> >> > cannot reasonable determine where to place the tier, so it will default >> >> > to MEMTIER_ADISTANCE_DRAM (which is the existing behavior). >> >> >> >> Why do we need this? Do you have machines with broken HMAT table? Can >> >> you ask the vendor to fix the HMAT table? >> >> >> > >> > It's a little unclear from the ACPI specification whether HMAT is >> > technically optional or not (given that the kernel handles missing HMAT >> > gracefully, it certainly seems optional). In one scenario I have seen >> > incorrect data, and in another scenario I have seen the HMAT omitted >> > entirely. In another scenario I have seen the HMAT-SLLBI omitted while >> > the CDAT is present. >> >> IIUC, HMAT is optional. Is it possible for you to ask the system vendor >> to fix the broken HMAT table. >> > > In this case we are (BW=0), but in the other cases, there is technically > nothing broken. That's my concern. > >> > In all scenarios the result is the same: all nodes in the same tier. >> >> I don't think so, in drivers/dax/kmem.c, we will put memory devices >> onlined by kmem.c in another tier by default. >> > > This presumes driver configured devices, which is not always the case. > > kmem.c will set MEMTIER_DEFAULT_DAX_ADISTANCE > > but if BIOS/EFI has set up the node instead, you get the default of > MEMTIER_ADISTANCE_DRAM if HMAT is not present or otherwise not sane. "efi_fake_mem=" kernel parameter can be used to add "EFI_MEMORY_SP" flag to the memory range, so that kmem.c can manage it. > Not everyone is going to have the ability to get a platform vendor to > fix a BIOS bug, and I've seen this in production. So, some vendor build a machine with broken/missing HMAT/CDAT and wants users to use CXL memory devices in it? Have the vendor tested whether CXL memory devices work? >> > The HMAT is explicitly described as "A hint" in the ACPI spec. >> > >> > ACPI 5.2.28.1 HMAT Overview >> > >> > "The software is expected to use this information as a hint for >> > optimization, or when the system has heterogeneous memory" >> > >> > If something is "a hint", then it should not be used prescriptively. >> > >> > Right now HMAT appears to be used prescriptively, this despite the fact >> > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in >> > the memory-tier code. So this patch simply realizes this intent when the >> > hints are not very reasonable. >> >> If HMAT isn't available, it's hard to put memory devices to >> appropriate memory tiers without other information. > > Not having a CPU is "other information". What tier a device belongs to > is really arbitrary, "appropriate" is at best a codified opinion. > >> In commit >> 992bf77591cb ("mm/demotion: add support for explicit memory tiers"), >> Aneesh pointed out that it doesn't work for his system to put >> non-CPU-nodes in lower tier. >> > > This seems like a bug / something else incorrect. I will investigate. > >> Even if we want to use other information to put memory devices to memory >> tiers, we can register another adist calculation callback instead of >> reusing hmat callback. >> > > I suppose during init, we could register a default adist callback with > CPU/non-CPU checks if HMAT is not sane. I can look at that. > > It might also be worth having some kind of modal mechanism, like: > > echo "auto" > /sys/.../memory_tiering/mode # Auto select mode > echo "hmat" > /sys/.../memory_tiering/mode # Use HMAT Info > echo "simple" > /sys/.../memory_tiering/mode # CPU vs non-CPU Node > echo "topology" > /sys/.../memory_tiering/mode # More complex > > To abstract away the hardware complexities as best as possible. > > But the first step here would be creating two modes. HMAT-is-sane and > CPU/Non-CPU seems reasonable to me but open to opinions. IMHO, we should reduce user configurable knobs unless we can prove it is really necessary. -- Best Regards, Huang, Ying