Hao Xiang <hao.xiang@xxxxxxxxxxxxx> writes: > On Thu, Jan 11, 2024 at 11:02 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> Hao Xiang <hao.xiang@xxxxxxxxxxxxx> writes: >> >> > On Wed, Jan 10, 2024 at 6:18 AM Jonathan Cameron >> > <Jonathan.Cameron@xxxxxxxxxx> wrote: >> >> >> >> On Tue, 9 Jan 2024 16:28:15 -0800 >> >> Hao Xiang <hao.xiang@xxxxxxxxxxxxx> wrote: >> >> >> >> > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <gregory.price@xxxxxxxxxxxx> wrote: >> >> > > >> >> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote: >> >> > > > On Tue, 09 Jan 2024 11:41:11 +0800 >> >> > > > "Huang, Ying" <ying.huang@xxxxxxxxx> wrote: >> >> > > > > Gregory Price <gregory.price@xxxxxxxxxxxx> writes: >> >> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: >> >> > > > > It's possible to change the performance of a NUMA node changed, if we >> >> > > > > hot-remove a memory device, then hot-add another different memory >> >> > > > > device. It's hoped that the CDAT changes too. >> >> > > > >> >> > > > Not supported, but ACPI has _HMA methods to in theory allow changing >> >> > > > HMAT values based on firmware notifications... So we 'could' make >> >> > > > it work for HMAT based description. >> >> > > > >> >> > > > Ultimately my current thinking is we'll end up emulating CXL type3 >> >> > > > devices (hiding topology complexity) and you can update CDAT but >> >> > > > IIRC that is only meant to be for degraded situations - so if you >> >> > > > want multiple performance regions, CDAT should describe them form the start. >> >> > > > >> >> > > >> >> > > That was my thought. I don't think it's particularly *realistic* for >> >> > > HMAT/CDAT values to change at runtime, but I can imagine a case where >> >> > > it could be valuable. >> >> > > >> >> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@xxxxxxxxxxxxxx/ >> >> > > > > > >> >> > > > > > This group wants to enable passing CXL memory through to KVM/QEMU >> >> > > > > > (i.e. host CXL expander memory passed through to the guest), and >> >> > > > > > allow the guest to apply memory tiering. >> >> > > > > > >> >> > > > > > There are multiple issues with this, presently: >> >> > > > > > >> >> > > > > > 1. The QEMU CXL virtual device is not and probably never will be >> >> > > > > > performant enough to be a commodity class virtualization. >> >> > > > >> >> > > > I'd flex that a bit - we will end up with a solution for virtualization but >> >> > > > it isn't the emulation that is there today because it's not possible to >> >> > > > emulate some of the topology in a peformant manner (interleaving with sub >> >> > > > page granularity / interleaving at all (to a lesser degree)). There are >> >> > > > ways to do better than we are today, but they start to look like >> >> > > > software dissagregated memory setups (think lots of page faults in the host). >> >> > > > >> >> > > >> >> > > Agreed, the emulated device as-is can't be the virtualization device, >> >> > > but it doesn't mean it can't be the basis for it. >> >> > > >> >> > > My thought is, if you want to pass host CXL *memory* through to the >> >> > > guest, you don't actually care to pass CXL *control* through to the >> >> > > guest. That control lies pretty squarely with the host/hypervisor. >> >> > > >> >> > > So, at least in theory, you can just cut the type3 device out of the >> >> > > QEMU configuration entirely and just pass it through as a distinct numa >> >> > > node with specific hmat qualities. >> >> > > >> >> > > Barring that, if we must go through the type3 device, the question is >> >> > > how difficult would it be to just make a stripped down type3 device >> >> > > to provide the informational components, but hack off anything >> >> > > topology/interleave related? Then you just do direct passthrough as you >> >> > > described below. >> >> > > >> >> > > qemu/kvm would report errors if you tried to touch the naughty bits. >> >> > > >> >> > > The second question is... is that device "compliant" or does it need >> >> > > super special handling from the kernel driver :D? If what i described >> >> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should >> >> > > just hide the CXL device entirely from the guest (for this use case) >> >> > > and just pass the memory through as a numa node. >> >> > > >> >> > > Which gets us back to: The memory-tiering component needs a way to >> >> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three >> >> > > of those seem like totally valid ways to go about it. >> >> > > >> >> > > > > > >> >> > > > > > 2. When passing memory through as an explicit NUMA node, but not as >> >> > > > > > part of a CXL memory device, the nodes are lumped together in the >> >> > > > > > DRAM tier. >> >> > > > > > >> >> > > > > > None of this has to do with firmware. >> >> > > > > > >> >> > > > > > Memory-type is an awful way of denoting membership of a tier, but we >> >> > > > > > have HMAT information that can be passed through via QEMU: >> >> > > > > > >> >> > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \ >> >> > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \ >> >> > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ >> >> > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ >> >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ >> >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ >> >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ >> >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 >> >> > > > > > >> >> > > > > > Not only would it be nice if we could change tier membership based on >> >> > > > > > this data, it's realistically the only way to allow guests to accomplish >> >> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest. >> >> > > > >> >> > > > This I fully agree with. There will be systems with a bunch of normal DDR with different >> >> > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used >> >> > > > before we get anything more complex in place for CXL. >> >> > > > >> >> > > >> >> > > Had not even considered this, but that's completely accurate as well. >> >> > > >> >> > > And more discretely: What of devices that don't provide HMAT/CDAT? That >> >> > > isn't necessarily a violation of any standard. There probably could be >> >> > > a release valve for us to still make those devices useful. >> >> > > >> >> > > The concern I have with not implementing a movement mechanism *at all* >> >> > > is that a one-size-fits-all initial-placement heuristic feels gross >> >> > > when we're, at least ideologically, moving toward "software defined memory". >> >> > > >> >> > > Personally I think the movement mechanism is a good idea that gets folks >> >> > > where they're going sooner, and it doesn't hurt anything by existing. We >> >> > > can change the initial placement mechanism too. >> >> > >> >> > I think providing users a way to "FIX" the memory tiering is a backup >> >> > option. Given that DDRs with different access characteristics provide >> >> > the relevant CDAT/HMAT information, the kernel should be able to >> >> > correctly establish memory tiering on boot. >> >> >> >> Include hotplug and I'll be happier! I know that's messy though. >> >> >> >> > Current memory tiering code has >> >> > 1) memory_tier_init() to iterate through all boot onlined memory >> >> > nodes. All nodes are assumed to be fast tier (adistance >> >> > MEMTIER_ADISTANCE_DRAM is used). >> >> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory >> >> > nodes. This is the place the kernel reads the memory attributes from >> >> > HMAT and recognizes the memory nodes into the correct tier (devdax >> >> > controlled CXL, pmem, etc). >> >> > If we want DDRs with different memory characteristics to be put into >> >> > the correct tier (as in the guest VM memory tiering case), we probably >> >> > need a third path to iterate the boot onlined memory nodes and also be >> >> > able to read their memory attributes. I don't think we can do that in >> >> > 1) because the ACPI subsystem is not yet initialized. >> >> >> >> Can we move it later in general? Or drag HMAT parsing earlier? >> >> ACPI table availability is pretty early, it's just that we don't bother >> >> with HMAT because nothing early uses it. >> >> IIRC SRAT parsing occurs way before memory_tier_init() will be called. >> > >> > I tested the call sequence under a debugger earlier. hmat_init() is >> > called after memory_tier_init(). Let me poke around and see what our >> > options are. >> >> This sounds reasonable. >> >> Please keep in mind that we need a way to identify the base line memory >> type(default_dram_type). A simple method is to use NUMA nodes with CPU >> attached. But I remember that Aneesh said that some NUMA nodes without >> CPU will need to be put in default_dram_type too on their systems. We >> need a way to identify that. > > Yes, I am doing some prototyping the way you described. In > memory_tier_init(), we will just set the memory tier for the NUMA > nodes with CPU. In hmat_init(), I am trying to call back to mm to > finish the memory tier initialization for the CPUless NUMA nodes. If a > CPUless numa node can't get the effective adistance from > mt_calc_adistance(), we will fallback to add that node to > default_dram_type. Sound reasonable for me. > The other thing I want to experiment is to call mt_calc_adistance() on > a memory node with CPU and see what kind of adistance will be > returned. Anyway, we need a base line to start. The abstract distance is calculated based on the ratio of the performance of a node to that of default DRAM node. -- Best Regards, Huang, Ying