Shaohe was dropped from the loop, adding him back. > -----Original Message----- > From: He Chen [mailto:he.chen@xxxxxxxxxxxxxxx] > Sent: Friday, December 9, 2016 3:46 PM > To: Daniel P. Berrange <berrange@xxxxxxxxxx> > Cc: libvir-list@xxxxxxxxxx; Du, Dolpher <dolpher.du@xxxxxxxxx>; Zyskowski, > Robert <robert.zyskowski@xxxxxxxxx>; Daniluk, Lukasz > <lukasz.daniluk@xxxxxxxxx>; Zang, Rui <rui.zang@xxxxxxxxx>; > jdenemar@xxxxxxxxxx > Subject: Re: [RFC] phi support in libvirt > > > On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote: > > > Hi all: > > > > > > As we are know Intel® Xeon phi targets high-performance computing and > > > other parallel workloads. > > > Now qemu has supported phi virtualization,it is time for libvirt to > > > support phi. > > > > Can you provide pointer to the relevant QEMU changes. > > > Xeon Phi Knights Landing (KNL) contains 2 primary hardware features, one > is up to 288 CPUs which needs patches to support and we are pushing it, > the other is Multi-Channel DRAM (MCDRAM) which does not need any changes > currently. > > Let me introduce more about MCDRAM, MCDRAM is on-package > high-bandwidth > memory (~500GB/s). > > On KNL platform, hardware expose MCDRAM as a seperate, CPUless and > remote NUMA node to OS so that MCDRAM will not be allocated by default > (since MCDRAM node has no CPU, every CPU regards MCDRAM node as > remote > node). In this way, MCDRAM can be reserved for certain specific > applications. > > > > Different from the traditional X86 server, There is a special numa > > > node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU . > > > > > > Now libvirt requires nonempty cpus argument for NUMA node, such as. > > > <numa> > > > <cell id='0' cpus='0-239' memory='80' unit='GiB'/> > > > <cell id='1' cpus='240-243' memory='16' unit='GiB'/> </numa> > > > > > > In order to support phi virtualization, libvirt needs to allow a numa > > > cell definition without 'cpu' attribution. > > > > > > Such as: > > > <numa> > > > <cell id='0' cpus='0-239' memory='80' unit='GiB'/> > > > <cell id='1' memory='16' unit='GiB'/> </numa> > > > > > > When a cell without 'cpu', qemu will allocate memory by default MCDRAM > instead of DDR. > > > > There's separate concepts at play which your description here is mixing up. > > > > First is the question of whether the guest NUMA node can be created with > only RAM or CPUs, or a mix of both. > > > > Second is the question of what kind of host RAM (MCDRAM vs DDR) is used > as the backing store for the guest > > > > Guest NUMA node shoulde be created with memory only (keep the same as > host's) and the more important things is the memory should bind to (come > from) host MCDRAM node. > > > These are separate configuration items which don't need to be conflated in > libvirt. ie we should be able to create a guest with a node containing only > memory, and back that by DDR on the host. Conversely we should be able to > create a guest with a node containing memory + cpus and back that by > MCDRAM on the host (even if that means the vCPUs will end up on a different > host node from its RAM) > > > > On the first point, there still appears to be some brokness in either QEMU or > Linux wrt configuration of virtual NUMA where either cpus or memory are > absent from nodes. > > > > eg if I launch QEMU with > > > > -numa node,nodeid=0,cpus=0-3,mem=512 > > -numa node,nodeid=1,mem=512 > > -numa node,nodeid=2,cpus=4-7 > > -numa node,nodeid=3,mem=512 > > -numa node,nodeid=4,mem=512 > > -numa node,nodeid=5,cpus=8-11 > > -numa node,nodeid=6,mem=1024 > > -numa node,nodeid=7,cpus=12-15,mem=1024 > > > > then the guest reports > > > > # numactl --hardware > > available: 6 nodes (0,3-7) > > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 > > node 0 size: 487 MB > > node 0 free: 230 MB > > node 3 cpus: 12 13 14 15 > > node 3 size: 1006 MB > > node 3 free: 764 MB > > node 4 cpus: > > node 4 size: 503 MB > > node 4 free: 498 MB > > node 5 cpus: > > node 5 size: 503 MB > > node 5 free: 499 MB > > node 6 cpus: > > node 6 size: 503 MB > > node 6 free: 498 MB > > node 7 cpus: > > node 7 size: 943 MB > > node 7 free: 939 MB > > > > so its pushed all the CPUs from nodes without RAM into the first node, and > moved CPUs from the 7th node into the 3rd node. > > > > I am not sure why this happens, but basically, I lauch QEMU like: > > -object > memory-backend-ram,size=20G,prealloc=yes,host-nodes=0,policy=bind,id=nod > e0 \ > -numa > node,nodeid=0,cpus=0-14,cpus=60-74,cpus=120-134,cpus=180-194,memdev=n > ode0 \ > > -object > memory-backend-ram,size=20G,prealloc=yes,host-nodes=1,policy=bind,id=nod > e1 \ > -numa > node,nodeid=1,cpus=15-29,cpus=75-89,cpus=135-149,cpus=195-209,memdev= > node1 \ > > -object > memory-backend-ram,size=20G,prealloc=yes,host-nodes=2,policy=bind,id=nod > e2 \ > -numa > node,nodeid=2,cpus=30-44,cpus=90-104,cpus=150-164,cpus=210-224,memdev > =node2 \ > > -object > memory-backend-ram,size=20G,prealloc=yes,host-nodes=3,policy=bind,id=nod > e3 \ > -numa > node,nodeid=3,cpus=45-59,cpus=105-119,cpus=165-179,cpus=225-239,memde > v=node3 \ > > -object > memory-backend-ram,size=3G,prealloc=yes,host-nodes=4,policy=bind,id=node > 4 \ > -numa node,nodeid=4,memdev=node4 \ > > -object > memory-backend-ram,size=3G,prealloc=yes,host-nodes=5,policy=bind,id=node > 5 \ > -numa node,nodeid=5,memdev=node5 \ > > -object > memory-backend-ram,size=3G,prealloc=yes,host-nodes=6,policy=bind,id=node > 6 \ > -numa node,nodeid=6,memdev=node6 \ > > -object > memory-backend-ram,size=3G,prealloc=yes,host-nodes=7,policy=bind,id=node > 7 \ > -numa node,nodeid=7,memdev=node7 \ > > (Please ignore the complex cpus parameters...) > As you can see, the pair of `-object memory-backend-ram` and `-numa` is > used to specify where the memory of the guest NUMA node is allocated > from. It works well for me :-) > > > So before considering MCDRAM / Phi, we need to fix this more basic NUMA > topology setup. > > > > > Now here I'd like to discuss these questions: > > > 1. This feature is only for Phi at present, but we > > > will check Phi platform for CPU-less NUMA node. > > > The NUMA node without CPU indicates MCDRAM node. > > > > We should not assume such semantics - it is a concept that is specific to > particular Intel x86_64 CPUs. We need to consider that other architectures > may have nodes without CPUs that are backed by normal DDR. > > IOW, we shoud be explicit about presence of MCDRAM in the host. > > > Agreed, but for KNL, that is how we detect MCDRAM on host: > 1. detect CPU family is Xeon Phi X200 (means KNL) > 2. enumerate all NUMA nodes and regard the nodes that contain memory > only as MCDRAM nodes. > > ... > > Thanks, > -He -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list