Re: [RFC] phi support in libvirt

He Chen <he.chen@xxxxxxxxxxxxxxx> · Fri, 9 Dec 2016 15:45:32 +0800

> On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote:
> > Hi all:
> > 
> > As we are know Intel® Xeon phi targets high-performance computing and 
> > other parallel workloads.
> > Now qemu has supported phi virtualization，it is time for libvirt to 
> > support phi.
> 
> Can you provide pointer to the relevant QEMU changes.
> 
Xeon Phi Knights Landing (KNL) contains 2 primary hardware features, one
is up to 288 CPUs which needs patches to support and we are pushing it,
the other is Multi-Channel DRAM (MCDRAM) which does not need any changes
currently.

Let me introduce more about MCDRAM, MCDRAM is on-package high-bandwidth
memory (~500GB/s).

On KNL platform, hardware expose MCDRAM as a seperate, CPUless and
remote NUMA node to OS so that MCDRAM will not be allocated by default
(since MCDRAM node has no CPU, every CPU regards MCDRAM node as remote
node). In this way, MCDRAM can be reserved for certain specific
applications.

> > Different from the traditional X86 server, There is a special numa 
> > node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU .
> > 
> > Now libvirt requires nonempty cpus argument for NUMA node, such as.
> > <numa>
> >   <cell id='0' cpus='0-239' memory='80' unit='GiB'/>
> >   <cell id='1' cpus='240-243' memory='16' unit='GiB'/> </numa>
> > 
> > In order to support phi virtualization, libvirt needs to allow a numa 
> > cell definition without 'cpu' attribution.
> >
> > Such as:
> > <numa>
> >   <cell id='0' cpus='0-239' memory='80' unit='GiB'/>
> >   <cell id='1' memory='16' unit='GiB'/> </numa>
> >
> > When a cell without 'cpu', qemu will allocate memory by default MCDRAM instead of DDR.
> 
> There's separate concepts at play which your description here is mixing up.
> 
> First is the question of whether the guest NUMA node can be created with only RAM or CPUs, or a mix of both.
> 
> Second is the question of what kind of host RAM (MCDRAM vs DDR) is used as the backing store for the guest
> 

Guest NUMA node shoulde be created with memory only (keep the same as
host's) and the more important things is the memory should bind to (come
from) host MCDRAM node.

> These are separate configuration items which don't need to be conflated in libvirt.  ie we should be able to create a guest with a node containing only memory, and back that by DDR on the host. Conversely we should be able to create a guest with a node containing memory + cpus and back that by MCDRAM on the host (even if that means the vCPUs will end up on a different host node from its RAM)
> 
> On the first point, there still appears to be some brokness in either QEMU or Linux wrt configuration of virtual NUMA where either cpus or memory are absent from nodes.
> 
> eg if I launch QEMU with
> 
>     -numa node,nodeid=0,cpus=0-3,mem=512
>     -numa node,nodeid=1,mem=512
>     -numa node,nodeid=2,cpus=4-7
>     -numa node,nodeid=3,mem=512
>     -numa node,nodeid=4,mem=512
>     -numa node,nodeid=5,cpus=8-11
>     -numa node,nodeid=6,mem=1024
>     -numa node,nodeid=7,cpus=12-15,mem=1024
> 
> then the guest reports
> 
>   # numactl --hardware
>   available: 6 nodes (0,3-7)
>   node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
>   node 0 size: 487 MB
>   node 0 free: 230 MB
>   node 3 cpus: 12 13 14 15
>   node 3 size: 1006 MB
>   node 3 free: 764 MB
>   node 4 cpus:
>   node 4 size: 503 MB
>   node 4 free: 498 MB
>   node 5 cpus:
>   node 5 size: 503 MB
>   node 5 free: 499 MB
>   node 6 cpus:
>   node 6 size: 503 MB
>   node 6 free: 498 MB
>   node 7 cpus:
>   node 7 size: 943 MB
>   node 7 free: 939 MB
> 
> so its pushed all the CPUs from nodes without RAM into the first node, and moved CPUs from the 7th node into the 3rd node.
> 

I am not sure why this happens, but basically, I lauch QEMU like:

-object memory-backend-ram,size=20G,prealloc=yes,host-nodes=0,policy=bind,id=node0 \
-numa node,nodeid=0,cpus=0-14,cpus=60-74,cpus=120-134,cpus=180-194,memdev=node0 \

-object memory-backend-ram,size=20G,prealloc=yes,host-nodes=1,policy=bind,id=node1 \
-numa node,nodeid=1,cpus=15-29,cpus=75-89,cpus=135-149,cpus=195-209,memdev=node1 \

-object memory-backend-ram,size=20G,prealloc=yes,host-nodes=2,policy=bind,id=node2 \
-numa node,nodeid=2,cpus=30-44,cpus=90-104,cpus=150-164,cpus=210-224,memdev=node2 \

-object memory-backend-ram,size=20G,prealloc=yes,host-nodes=3,policy=bind,id=node3 \
-numa node,nodeid=3,cpus=45-59,cpus=105-119,cpus=165-179,cpus=225-239,memdev=node3 \

-object memory-backend-ram,size=3G,prealloc=yes,host-nodes=4,policy=bind,id=node4 \
-numa node,nodeid=4,memdev=node4 \

-object memory-backend-ram,size=3G,prealloc=yes,host-nodes=5,policy=bind,id=node5 \
-numa node,nodeid=5,memdev=node5 \

-object memory-backend-ram,size=3G,prealloc=yes,host-nodes=6,policy=bind,id=node6 \
-numa node,nodeid=6,memdev=node6 \

-object memory-backend-ram,size=3G,prealloc=yes,host-nodes=7,policy=bind,id=node7 \
-numa node,nodeid=7,memdev=node7 \

(Please ignore the complex cpus parameters...)
As you can see, the pair of `-object memory-backend-ram` and `-numa` is
used to specify where the memory of the guest NUMA node is allocated
from. It works well for me :-)

> So before considering MCDRAM / Phi, we need to fix this more basic NUMA topology setup.
> 
> > Now here I'd like to discuss these questions:
> > 1. This feature is only for Phi at present, but we
> >    will check Phi platform for CPU-less NUMA node.
> >    The NUMA node without CPU indicates MCDRAM node.
> 
> We should not assume such semantics - it is a concept that is specific to particular Intel x86_64 CPUs. We need to consider that other architectures may have nodes without CPUs that are backed by normal DDR.
> IOW, we shoud be explicit about presence of MCDRAM in the host.
> 
Agreed, but for KNL, that is how we detect MCDRAM on host:
1. detect CPU family is Xeon Phi X200 (means KNL)
2. enumerate all NUMA nodes and regard the nodes that contain memory
only as MCDRAM nodes.

...

Thanks,
-He

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list