Thanks. Dolpher. Reply inline. On 2016年12月21日 11:56, Du, Dolpher wrote:
Shaohe was dropped from the loop, adding him back.-----Original Message----- From: He Chen [mailto:he.chen@xxxxxxxxxxxxxxx] Sent: Friday, December 9, 2016 3:46 PM To: Daniel P. Berrange <berrange@xxxxxxxxxx> Cc: libvir-list@xxxxxxxxxx; Du, Dolpher <dolpher.du@xxxxxxxxx>; Zyskowski, Robert <robert.zyskowski@xxxxxxxxx>; Daniluk, Lukasz <lukasz.daniluk@xxxxxxxxx>; Zang, Rui <rui.zang@xxxxxxxxx>; jdenemar@xxxxxxxxxx Subject: Re: [RFC] phi support in libvirtOn Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote:Hi all: As we are know Intel® Xeon phi targets high-performance computing and other parallel workloads. Now qemu has supported phi virtualization,it is time for libvirt to support phi.Can you provide pointer to the relevant QEMU changes.Xeon Phi Knights Landing (KNL) contains 2 primary hardware features, one is up to 288 CPUs which needs patches to support and we are pushing it, the other is Multi-Channel DRAM (MCDRAM) which does not need any changes currently. Let me introduce more about MCDRAM, MCDRAM is on-package high-bandwidth memory (~500GB/s). On KNL platform, hardware expose MCDRAM as a seperate, CPUless and remote NUMA node to OS so that MCDRAM will not be allocated by default (since MCDRAM node has no CPU, every CPU regards MCDRAM node as remote node). In this way, MCDRAM can be reserved for certain specific applications.Different from the traditional X86 server, There is a special numa node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU . Now libvirt requires nonempty cpus argument for NUMA node, such as. <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' cpus='240-243' memory='16' unit='GiB'/> </numa> In order to support phi virtualization, libvirt needs to allow a numa cell definition without 'cpu' attribution. Such as: <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' memory='16' unit='GiB'/> </numa> When a cell without 'cpu', qemu will allocate memory by default MCDRAMinstead of DDR.There's separate concepts at play which your description here is mixing up. First is the question of whether the guest NUMA node can be created withonly RAM or CPUs, or a mix of both.Second is the question of what kind of host RAM (MCDRAM vs DDR) is usedas the backing store for the guest Guest NUMA node shoulde be created with memory only (keep the same as host's) and the more important things is the memory should bind to (come from) host MCDRAM node.
So I suggest libvirt distinguish the MCDRAMAnd the MCDRAM numa config as follow, add a "mcdram" attribute for "cell" element:
<numa> <cell id='1' mcdram='16' unit='GiB'/> </numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/>
These are separate configuration items which don't need to be conflated inlibvirt. ie we should be able to create a guest with a node containing only memory, and back that by DDR on the host. Conversely we should be able to create a guest with a node containing memory + cpus and back that by MCDRAM on the host (even if that means the vCPUs will end up on a different host node from its RAM)On the first point, there still appears to be some brokness in either QEMU orLinux wrt configuration of virtual NUMA where either cpus or memory are absent from nodes.eg if I launch QEMU with -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 -numa node,nodeid=2,cpus=4-7 -numa node,nodeid=3,mem=512 -numa node,nodeid=4,mem=512 -numa node,nodeid=5,cpus=8-11 -numa node,nodeid=6,mem=1024 -numa node,nodeid=7,cpus=12-15,mem=1024 then the guest reports # numactl --hardware available: 6 nodes (0,3-7) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 node 0 size: 487 MB node 0 free: 230 MB node 3 cpus: 12 13 14 15 node 3 size: 1006 MB node 3 free: 764 MB node 4 cpus: node 4 size: 503 MB node 4 free: 498 MB node 5 cpus: node 5 size: 503 MB node 5 free: 499 MB node 6 cpus: node 6 size: 503 MB node 6 free: 498 MB node 7 cpus: node 7 size: 943 MB node 7 free: 939 MB so its pushed all the CPUs from nodes without RAM into the first node, andmoved CPUs from the 7th node into the 3rd node.
seems it is a bug. He Chen, Do you know how qemu generates the numa node for guest.Can qemu do sanity check of Host Physical Numa topology, and generate a smart guest Numa topology?
I am not sure why this happens, but basically, I lauch QEMU like: -object memory-backend-ram,size=20G,prealloc=yes,host-nodes=0,policy=bind,id=nod e0 \ -numa node,nodeid=0,cpus=0-14,cpus=60-74,cpus=120-134,cpus=180-194,memdev=n ode0 \ -object memory-backend-ram,size=20G,prealloc=yes,host-nodes=1,policy=bind,id=nod e1 \ -numa node,nodeid=1,cpus=15-29,cpus=75-89,cpus=135-149,cpus=195-209,memdev= node1 \ -object memory-backend-ram,size=20G,prealloc=yes,host-nodes=2,policy=bind,id=nod e2 \ -numa node,nodeid=2,cpus=30-44,cpus=90-104,cpus=150-164,cpus=210-224,memdev =node2 \ -object memory-backend-ram,size=20G,prealloc=yes,host-nodes=3,policy=bind,id=nod e3 \ -numa node,nodeid=3,cpus=45-59,cpus=105-119,cpus=165-179,cpus=225-239,memde v=node3 \ -object memory-backend-ram,size=3G,prealloc=yes,host-nodes=4,policy=bind,id=node 4 \ -numa node,nodeid=4,memdev=node4 \ -object memory-backend-ram,size=3G,prealloc=yes,host-nodes=5,policy=bind,id=node 5 \ -numa node,nodeid=5,memdev=node5 \ -object memory-backend-ram,size=3G,prealloc=yes,host-nodes=6,policy=bind,id=node 6 \ -numa node,nodeid=6,memdev=node6 \ -object memory-backend-ram,size=3G,prealloc=yes,host-nodes=7,policy=bind,id=node 7 \ -numa node,nodeid=7,memdev=node7 \ (Please ignore the complex cpus parameters...) As you can see, the pair of `-object memory-backend-ram` and `-numa` is used to specify where the memory of the guest NUMA node is allocated from. It works well for me :-)
When a "mcdram" in "cell", we banding it to the Physical numa by specify the "object"
<numa> <cell id='1' mcdram='16' unit='GiB'/> </numa>
So before considering MCDRAM / Phi, we need to fix this more basic NUMAtopology setup.Now here I'd like to discuss these questions: 1. This feature is only for Phi at present, but we will check Phi platform for CPU-less NUMA node. The NUMA node without CPU indicates MCDRAM node.We should not assume such semantics - it is a concept that is specific toparticular Intel x86_64 CPUs. We need to consider that other architectures may have nodes without CPUs that are backed by normal DDR.IOW, we shoud be explicit about presence of MCDRAM in the host.Agreed, but for KNL, that is how we detect MCDRAM on host: 1. detect CPU family is Xeon Phi X200 (means KNL) 2. enumerate all NUMA nodes and regard the nodes that contain memory only as MCDRAM nodes.
When a "mcdram" in "cell", we detect the MCDRAM, do some check and banding it to the Physical numa <numa> <cell id='1' mcdram='16' unit='GiB'/> </numa>
... Thanks, -He
-- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list