Re: [RFC] phi support in libvirt

"Du, Dolpher" <dolpher.du@xxxxxxxxx> · Wed, 21 Dec 2016 03:56:44 +0000

Shaohe was dropped from the loop, adding him back.

> -----Original Message-----
> From: He Chen [mailto:he.chen@xxxxxxxxxxxxxxx]
> Sent: Friday, December 9, 2016 3:46 PM
> To: Daniel P. Berrange <berrange@xxxxxxxxxx>
> Cc: libvir-list@xxxxxxxxxx; Du, Dolpher <dolpher.du@xxxxxxxxx>; Zyskowski,
> Robert <robert.zyskowski@xxxxxxxxx>; Daniluk, Lukasz
> <lukasz.daniluk@xxxxxxxxx>; Zang, Rui <rui.zang@xxxxxxxxx>;
> jdenemar@xxxxxxxxxx
> Subject: Re:  [RFC] phi support in libvirt
> 
> > On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote:
> > > Hi all:
> > >
> > > As we are know Intel® Xeon phi targets high-performance computing and
> > > other parallel workloads.
> > > Now qemu has supported phi virtualization，it is time for libvirt to
> > > support phi.
> >
> > Can you provide pointer to the relevant QEMU changes.
> >
> Xeon Phi Knights Landing (KNL) contains 2 primary hardware features, one
> is up to 288 CPUs which needs patches to support and we are pushing it,
> the other is Multi-Channel DRAM (MCDRAM) which does not need any changes
> currently.
> 
> Let me introduce more about MCDRAM, MCDRAM is on-package
> high-bandwidth
> memory (~500GB/s).
> 
> On KNL platform, hardware expose MCDRAM as a seperate, CPUless and
> remote NUMA node to OS so that MCDRAM will not be allocated by default
> (since MCDRAM node has no CPU, every CPU regards MCDRAM node as
> remote
> node). In this way, MCDRAM can be reserved for certain specific
> applications.
> 
> > > Different from the traditional X86 server, There is a special numa
> > > node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU .
> > >
> > > Now libvirt requires nonempty cpus argument for NUMA node, such as.
> > > <numa>
> > >   <cell id='0' cpus='0-239' memory='80' unit='GiB'/>
> > >   <cell id='1' cpus='240-243' memory='16' unit='GiB'/> </numa>
> > >
> > > In order to support phi virtualization, libvirt needs to allow a numa
> > > cell definition without 'cpu' attribution.
> > >
> > > Such as:
> > > <numa>
> > >   <cell id='0' cpus='0-239' memory='80' unit='GiB'/>
> > >   <cell id='1' memory='16' unit='GiB'/> </numa>
> > >
> > > When a cell without 'cpu', qemu will allocate memory by default MCDRAM
> instead of DDR.
> >
> > There's separate concepts at play which your description here is mixing up.
> >
> > First is the question of whether the guest NUMA node can be created with
> only RAM or CPUs, or a mix of both.
> >
> > Second is the question of what kind of host RAM (MCDRAM vs DDR) is used
> as the backing store for the guest
> >
> 
> Guest NUMA node shoulde be created with memory only (keep the same as
> host's) and the more important things is the memory should bind to (come
> from) host MCDRAM node.
> 
> > These are separate configuration items which don't need to be conflated in
> libvirt.  ie we should be able to create a guest with a node containing only
> memory, and back that by DDR on the host. Conversely we should be able to
> create a guest with a node containing memory + cpus and back that by
> MCDRAM on the host (even if that means the vCPUs will end up on a different
> host node from its RAM)
> >
> > On the first point, there still appears to be some brokness in either QEMU or
> Linux wrt configuration of virtual NUMA where either cpus or memory are
> absent from nodes.
> >
> > eg if I launch QEMU with
> >
> >     -numa node,nodeid=0,cpus=0-3,mem=512
> >     -numa node,nodeid=1,mem=512
> >     -numa node,nodeid=2,cpus=4-7
> >     -numa node,nodeid=3,mem=512
> >     -numa node,nodeid=4,mem=512
> >     -numa node,nodeid=5,cpus=8-11
> >     -numa node,nodeid=6,mem=1024
> >     -numa node,nodeid=7,cpus=12-15,mem=1024
> >
> > then the guest reports
> >
> >   # numactl --hardware
> >   available: 6 nodes (0,3-7)
> >   node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
> >   node 0 size: 487 MB
> >   node 0 free: 230 MB
> >   node 3 cpus: 12 13 14 15
> >   node 3 size: 1006 MB
> >   node 3 free: 764 MB
> >   node 4 cpus:
> >   node 4 size: 503 MB
> >   node 4 free: 498 MB
> >   node 5 cpus:
> >   node 5 size: 503 MB
> >   node 5 free: 499 MB
> >   node 6 cpus:
> >   node 6 size: 503 MB
> >   node 6 free: 498 MB
> >   node 7 cpus:
> >   node 7 size: 943 MB
> >   node 7 free: 939 MB
> >
> > so its pushed all the CPUs from nodes without RAM into the first node, and
> moved CPUs from the 7th node into the 3rd node.
> >
> 
> I am not sure why this happens, but basically, I lauch QEMU like:
> 
> -object
> memory-backend-ram,size=20G,prealloc=yes,host-nodes=0,policy=bind,id=nod
> e0 \
> -numa
> node,nodeid=0,cpus=0-14,cpus=60-74,cpus=120-134,cpus=180-194,memdev=n
> ode0 \
> 
> -object
> memory-backend-ram,size=20G,prealloc=yes,host-nodes=1,policy=bind,id=nod
> e1 \
> -numa
> node,nodeid=1,cpus=15-29,cpus=75-89,cpus=135-149,cpus=195-209,memdev=
> node1 \
> 
> -object
> memory-backend-ram,size=20G,prealloc=yes,host-nodes=2,policy=bind,id=nod
> e2 \
> -numa
> node,nodeid=2,cpus=30-44,cpus=90-104,cpus=150-164,cpus=210-224,memdev
> =node2 \
> 
> -object
> memory-backend-ram,size=20G,prealloc=yes,host-nodes=3,policy=bind,id=nod
> e3 \
> -numa
> node,nodeid=3,cpus=45-59,cpus=105-119,cpus=165-179,cpus=225-239,memde
> v=node3 \
> 
> -object
> memory-backend-ram,size=3G,prealloc=yes,host-nodes=4,policy=bind,id=node
> 4 \
> -numa node,nodeid=4,memdev=node4 \
> 
> -object
> memory-backend-ram,size=3G,prealloc=yes,host-nodes=5,policy=bind,id=node
> 5 \
> -numa node,nodeid=5,memdev=node5 \
> 
> -object
> memory-backend-ram,size=3G,prealloc=yes,host-nodes=6,policy=bind,id=node
> 6 \
> -numa node,nodeid=6,memdev=node6 \
> 
> -object
> memory-backend-ram,size=3G,prealloc=yes,host-nodes=7,policy=bind,id=node
> 7 \
> -numa node,nodeid=7,memdev=node7 \
> 
> (Please ignore the complex cpus parameters...)
> As you can see, the pair of `-object memory-backend-ram` and `-numa` is
> used to specify where the memory of the guest NUMA node is allocated
> from. It works well for me :-)
> 
> > So before considering MCDRAM / Phi, we need to fix this more basic NUMA
> topology setup.
> >
> > > Now here I'd like to discuss these questions:
> > > 1. This feature is only for Phi at present, but we
> > >    will check Phi platform for CPU-less NUMA node.
> > >    The NUMA node without CPU indicates MCDRAM node.
> >
> > We should not assume such semantics - it is a concept that is specific to
> particular Intel x86_64 CPUs. We need to consider that other architectures
> may have nodes without CPUs that are backed by normal DDR.
> > IOW, we shoud be explicit about presence of MCDRAM in the host.
> >
> Agreed, but for KNL, that is how we detect MCDRAM on host:
> 1. detect CPU family is Xeon Phi X200 (means KNL)
> 2. enumerate all NUMA nodes and regard the nodes that contain memory
> only as MCDRAM nodes.
> 
> ...
> 
> Thanks,
> -He

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list