Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: >> > >> > From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf >> > abstract_distance_offset: override by users to deal with firmware issue. >> > >> > say firmware can configure the cxl node into wrong tiers, similar to >> > that it may also configure all cxl nodes into single memtype, hence >> > all these nodes can fall into a single wrong tier. >> > In this case, per node adistance_offset would be good to have ? >> >> I think that it's better to fix the error firmware if possible. And >> these are only theoretical, not practical issues. Do you have some >> practical issues? >> >> I understand that users may want to move nodes between memory tiers for >> different policy choices. For that, memory_type based adistance_offset >> should be good. >> > > There's actually an affirmative case to change memory tiering to allow > either movement of nodes between tiers, or at least base placement on > HMAT information. Preferably, membership would be changable to allow > hotplug/DCD to be managed (there's no guarantee that the memory passed > through will always be what HMAT says on initial boot). IIUC, from Jonathan Cameron as below, the performance of memory shouldn't change even for DCD devices. https://lore.kernel.org/linux-mm/20231103141636.000007e4@xxxxxxxxxx/ It's possible to change the performance of a NUMA node changed, if we hot-remove a memory device, then hot-add another different memory device. It's hoped that the CDAT changes too. So, all in all, HMAT + CDAT can help us to put the memory device in appropriate memory tiers. Now, we have HMAT support in upstream. We will working on CDAT support. -- Best Regards, Huang, Ying > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@xxxxxxxxxxxxxx/ > > This group wants to enable passing CXL memory through to KVM/QEMU > (i.e. host CXL expander memory passed through to the guest), and > allow the guest to apply memory tiering. > > There are multiple issues with this, presently: > > 1. The QEMU CXL virtual device is not and probably never will be > performant enough to be a commodity class virtualization. The > reason is that the virtual CXL device is built off the I/O > virtualization stack, which treats memory accesses as I/O accesses. > > KVM also seems incompatible with the design of the CXL memory device > in general, but this problem may or may not be a blocker. > > As a result, access to virtual CXL memory device leads to QEMU > crawling to a halt - and this is unlikely to change. > > There is presently no good way forward to create a performant virtual > CXL device in QEMU. This means the memory tiering component in the > kernel is functionally useless for virtual CXL memory, because... > > 2. When passing memory through as an explicit NUMA node, but not as > part of a CXL memory device, the nodes are lumped together in the > DRAM tier. > > None of this has to do with firmware. > > Memory-type is an awful way of denoting membership of a tier, but we > have HMAT information that can be passed through via QEMU: > > -object memory-backend-ram,size=4G,id=ram-node0 \ > -object memory-backend-ram,size=4G,id=ram-node1 \ > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 > > Not only would it be nice if we could change tier membership based on > this data, it's realistically the only way to allow guests to accomplish > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest. > > ~Gregory