Re: [RFC] phi support in libvirt

"Daniel P. Berrange" <berrange@xxxxxxxxxx> · Wed, 7 Dec 2016 15:11:49 +0000

On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote:
> Hi all:
> 
> As we are know Intel® Xeon phi targets high-performance computing and other
> parallel workloads.
> Now qemu has supported phi virtualization，it is time for libvirt to support
> phi.

Can you provide pointer to the relevant QEMU changes.

> Different from the traditional X86 server, There is a special numa node with
> Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU .
> 
> Now libvirt requires nonempty cpus argument for NUMA node, such as.
> <numa>
>   <cell id='0' cpus='0-239' memory='80' unit='GiB'/>
>   <cell id='1' cpus='240-243' memory='16' unit='GiB'/>
> </numa>
> 
> In order to support phi virtualization, libvirt needs to allow a numa cell
> definition without 'cpu' attribution.
>
> Such as:
> <numa>
>   <cell id='0' cpus='0-239' memory='80' unit='GiB'/>
>   <cell id='1' memory='16' unit='GiB'/>
> </numa>
>
> When a cell without 'cpu', qemu will allocate memory by default MCDRAM instead of DDR.

There's separate concepts at play which your description here is
mixing up.

First is the question of whether the guest NUMA node can be created
with only RAM or CPUs, or a mix of both.

Second is the question of what kind of host RAM (MCDRAM vs DDR) is
used as the backing store for the guest

These are separate configuration items which don't need to be
conflated in libvirt.  ie we should be able to create a guest
with a node containing only memory, and back that by DDR on
the host. Conversely we should be able to create a guest
with a node containing memory + cpus and back that by MCDRAM
on the host (even if that means the vCPUs will end up on a
different host node from its RAM)

On the first point, there still appears to be some brokness
in either QEMU or Linux wrt configuration of virtual NUMA
where either cpus or memory are absent from nodes.

eg if I launch QEMU with

    -numa node,nodeid=0,cpus=0-3,mem=512
    -numa node,nodeid=1,mem=512
    -numa node,nodeid=2,cpus=4-7
    -numa node,nodeid=3,mem=512
    -numa node,nodeid=4,mem=512
    -numa node,nodeid=5,cpus=8-11
    -numa node,nodeid=6,mem=1024
    -numa node,nodeid=7,cpus=12-15,mem=1024

then the guest reports

  # numactl --hardware
  available: 6 nodes (0,3-7)
  node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
  node 0 size: 487 MB
  node 0 free: 230 MB
  node 3 cpus: 12 13 14 15
  node 3 size: 1006 MB
  node 3 free: 764 MB
  node 4 cpus:
  node 4 size: 503 MB
  node 4 free: 498 MB
  node 5 cpus:
  node 5 size: 503 MB
  node 5 free: 499 MB
  node 6 cpus:
  node 6 size: 503 MB
  node 6 free: 498 MB
  node 7 cpus:
  node 7 size: 943 MB
  node 7 free: 939 MB

so its pushed all the CPUs from nodes without RAM into the
first node, and moved CPUs from the 7th node into the 3rd
node.

So before considering MCDRAM / Phi, we need to fix this more
basic NUMA topology setup.

> Now here I'd like to discuss these questions:
> 1. This feature is only for Phi at present, but we
>    will check Phi platform for CPU-less NUMA node.
>    The NUMA node without CPU indicates MCDRAM node.

We should not assume such semantics - it is a concept
that is specific to particular Intel x86_64 CPUs. We
need to consider that other architectures may have
nodes without CPUs that are backed by normal DDR.
IOW, we shoud be explicit about presence of MCDRAM
in the host.

>    And for now MCDRAM is available only for PHI.
> However, there is no reason that any other platform
> couldn’t define CPU-less NUMA node using libvirt, so
> there is no reason to check if PHI is used or not.

> 2. Type of memory of CPU-less NUMA node will not be
> checked during machine creation/configuration step.
> There is no reliable way to distinguish if the node
> is MCDRAM or regular DDR. This step is not concerned
> with type of the memory, only with NUMA assignment.

If we can't distinguish MCDRAM from DDR that's a problem
for apps, given your next point about MCDRAM not supporting
over commit.

> 3. Unlike traditional memory assign to a VM, MCDRAM do not
> support over commit
>  If the memory of a virtual NUMA node is going to be
> explicitly bound to physical NUMA node then it shouldn’t
> exceed the size of its corresponding NUMA node, doesn’t
> matter if it is MCDRAM or DDR.

It is valid to bind guests to NUMA nodes and still have
memory over commit, so we do need to know if a host node
is using MCDRAM or DDR, so apps can determine whether
that node supports over commit or not.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list