Re: [RFC] phi support in libvirt

"Feng, Shaohe" <shaohe.feng@xxxxxxxxx> · Wed, 21 Dec 2016 12:51:29 +0800

Thanks.  Dolpher.

Reply inline.

On 2016年12月21日 11:56, Du, Dolpher wrote:
Shaohe was dropped from the loop, adding him back.

-----Original Message-----
From: He Chen [mailto:he.chen@xxxxxxxxxxxxxxx]
Sent: Friday, December 9, 2016 3:46 PM
To: Daniel P. Berrange <berrange@xxxxxxxxxx>
Cc: libvir-list@xxxxxxxxxx; Du, Dolpher <dolpher.du@xxxxxxxxx>; Zyskowski,
Robert <robert.zyskowski@xxxxxxxxx>; Daniluk, Lukasz
<lukasz.daniluk@xxxxxxxxx>; Zang, Rui <rui.zang@xxxxxxxxx>;
jdenemar@xxxxxxxxxx
Subject: Re:  [RFC] phi support in libvirt

On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote:
Hi all:

As we are know Intel® Xeon phi targets high-performance computing and
other parallel workloads.
Now qemu has supported phi virtualization，it is time for libvirt to
support phi.
Can you provide pointer to the relevant QEMU changes.

Xeon Phi Knights Landing (KNL) contains 2 primary hardware features, one
is up to 288 CPUs which needs patches to support and we are pushing it,
the other is Multi-Channel DRAM (MCDRAM) which does not need any changes
currently.

Let me introduce more about MCDRAM, MCDRAM is on-package
high-bandwidth
memory (~500GB/s).

On KNL platform, hardware expose MCDRAM as a seperate, CPUless and
remote NUMA node to OS so that MCDRAM will not be allocated by default
(since MCDRAM node has no CPU, every CPU regards MCDRAM node as
remote
node). In this way, MCDRAM can be reserved for certain specific
applications.

Different from the traditional X86 server, There is a special numa
node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU .

Now libvirt requires nonempty cpus argument for NUMA node, such as.
<numa>
   <cell id='0' cpus='0-239' memory='80' unit='GiB'/>
   <cell id='1' cpus='240-243' memory='16' unit='GiB'/> </numa>

In order to support phi virtualization, libvirt needs to allow a numa
cell definition without 'cpu' attribution.

Such as:
<numa>
   <cell id='0' cpus='0-239' memory='80' unit='GiB'/>
   <cell id='1' memory='16' unit='GiB'/> </numa>

When a cell without 'cpu', qemu will allocate memory by default MCDRAM
instead of DDR.
There's separate concepts at play which your description here is mixing up.

First is the question of whether the guest NUMA node can be created with
only RAM or CPUs, or a mix of both.
Second is the question of what kind of host RAM (MCDRAM vs DDR) is used
as the backing store for the guest
Guest NUMA node shoulde be created with memory only (keep the same as
host's) and the more important things is the memory should bind to (come
from) host MCDRAM node.
So I suggest libvirt distinguish the MCDRAM

And the MCDRAM numa config as follow, add a "mcdram" attribute for 
"cell" element:
<numa>
  <cell id='1'  mcdram='16' unit='GiB'/> </numa>
  <cell id='0' cpus='0-239' memory='80' unit='GiB'/>

These are separate configuration items which don't need to be conflated in
libvirt.  ie we should be able to create a guest with a node containing only
memory, and back that by DDR on the host. Conversely we should be able to
create a guest with a node containing memory + cpus and back that by
MCDRAM on the host (even if that means the vCPUs will end up on a different
host node from its RAM)
On the first point, there still appears to be some brokness in either QEMU or
Linux wrt configuration of virtual NUMA where either cpus or memory are
absent from nodes.
eg if I launch QEMU with

     -numa node,nodeid=0,cpus=0-3,mem=512
     -numa node,nodeid=1,mem=512
     -numa node,nodeid=2,cpus=4-7
     -numa node,nodeid=3,mem=512
     -numa node,nodeid=4,mem=512
     -numa node,nodeid=5,cpus=8-11
     -numa node,nodeid=6,mem=1024
     -numa node,nodeid=7,cpus=12-15,mem=1024

then the guest reports

   # numactl --hardware
   available: 6 nodes (0,3-7)
   node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
   node 0 size: 487 MB
   node 0 free: 230 MB
   node 3 cpus: 12 13 14 15
   node 3 size: 1006 MB
   node 3 free: 764 MB
   node 4 cpus:
   node 4 size: 503 MB
   node 4 free: 498 MB
   node 5 cpus:
   node 5 size: 503 MB
   node 5 free: 499 MB
   node 6 cpus:
   node 6 size: 503 MB
   node 6 free: 498 MB
   node 7 cpus:
   node 7 size: 943 MB
   node 7 free: 939 MB

so its pushed all the CPUs from nodes without RAM into the first node, and
moved CPUs from the 7th node into the 3rd node.
seems it is a bug.

He Chen, Do you know how qemu generates the numa node for  guest.
Can qemu do sanity check of Host Physical Numa topology， and generate a 
smart guest Numa topology?

I am not sure why this happens, but basically, I lauch QEMU like:

-object
memory-backend-ram,size=20G,prealloc=yes,host-nodes=0,policy=bind,id=nod
e0 \
-numa
node,nodeid=0,cpus=0-14,cpus=60-74,cpus=120-134,cpus=180-194,memdev=n
ode0 \

-object
memory-backend-ram,size=20G,prealloc=yes,host-nodes=1,policy=bind,id=nod
e1 \
-numa
node,nodeid=1,cpus=15-29,cpus=75-89,cpus=135-149,cpus=195-209,memdev=
node1 \

-object
memory-backend-ram,size=20G,prealloc=yes,host-nodes=2,policy=bind,id=nod
e2 \
-numa
node,nodeid=2,cpus=30-44,cpus=90-104,cpus=150-164,cpus=210-224,memdev
=node2 \

-object
memory-backend-ram,size=20G,prealloc=yes,host-nodes=3,policy=bind,id=nod
e3 \
-numa
node,nodeid=3,cpus=45-59,cpus=105-119,cpus=165-179,cpus=225-239,memde
v=node3 \

-object
memory-backend-ram,size=3G,prealloc=yes,host-nodes=4,policy=bind,id=node
4 \
-numa node,nodeid=4,memdev=node4 \

-object
memory-backend-ram,size=3G,prealloc=yes,host-nodes=5,policy=bind,id=node
5 \
-numa node,nodeid=5,memdev=node5 \

-object
memory-backend-ram,size=3G,prealloc=yes,host-nodes=6,policy=bind,id=node
6 \
-numa node,nodeid=6,memdev=node6 \

-object
memory-backend-ram,size=3G,prealloc=yes,host-nodes=7,policy=bind,id=node
7 \
-numa node,nodeid=7,memdev=node7 \

(Please ignore the complex cpus parameters...)
As you can see, the pair of `-object memory-backend-ram` and `-numa` is
used to specify where the memory of the guest NUMA node is allocated
from. It works well for me :-)

When a "mcdram" in "cell", we banding it to the Physical numa by specify 
the "object"

<numa>
  <cell id='1'  mcdram='16' unit='GiB'/> </numa>

So before considering MCDRAM / Phi, we need to fix this more basic NUMA
topology setup.
Now here I'd like to discuss these questions:
1. This feature is only for Phi at present, but we
    will check Phi platform for CPU-less NUMA node.
    The NUMA node without CPU indicates MCDRAM node.
We should not assume such semantics - it is a concept that is specific to
particular Intel x86_64 CPUs. We need to consider that other architectures
may have nodes without CPUs that are backed by normal DDR.
IOW, we shoud be explicit about presence of MCDRAM in the host.

Agreed, but for KNL, that is how we detect MCDRAM on host:
1. detect CPU family is Xeon Phi X200 (means KNL)
2. enumerate all NUMA nodes and regard the nodes that contain memory
only as MCDRAM nodes.

When a "mcdram" in "cell", we detect the MCDRAM, do some check and
 banding it to the Physical numa

<numa>
  <cell id='1'  mcdram='16' unit='GiB'/> </numa>

...

Thanks,
-He

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list