Re: Designing XML for HMAT

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Wed, 5 Feb 2020 14:52:11 +0000

On Thu, Jan 09, 2020 at 05:18:02PM +0100, Michal Privoznik wrote:
> Dear list,
> 
> QEMU gained support for configuring HMAT recently (see
> v4.2.0-415-g9b12dfa03a
> and friends). HMAT stands for Heterogeneous Memory Attribute Table and
> defines
> various attributes to NUMA. Guest OS/app can read these information and fine
> tune optimization. See [1] for more info (esp. links in the transcript).
> 
> QEMU defines so called initiator, which is an attribute to a NUMA node and
> if
> specified points to another node that has the best performance to this node.
> 
> For instance:
> 
>   -machine hmat=on \
>   -m 2G,slots=2,maxmem=4G \
>   -object memory-backend-ram,size=1G,id=m0 \
>   -object memory-backend-ram,size=1G,id=m1 \
>   -numa node,nodeid=0,memdev=m0 \
>   -numa node,nodeid=1,memdev=m1,initiator=0 \
>   -smp 2,sockets=2,maxcpus=2 \
>   -numa cpu,node-id=0,socket-id=0 \
>   -numa cpu,node-id=0,socket-id=1
> 
> creates a machine with 2 NUMA nodes, node 0 has CPUs and node 1 has memory
> only
> and it's initiator is node 0 (yes, HMAT allows you to create CPU-less "NUMA"
> nodes). The initiator of node 0 is not specified, but since the node has at
> least one CPU it is initiator to itself (and has to be per specs).
> 
> This could be represented by an attribute to our /domain/cpu/numa/cell
> element.
> For instance like this:
> 
>   <domain>
>     <vcpu>2</vcpu>
>     <cpu>
>       <numa>
>         <cell id='0' cpus='0,1' memory='1' unit='GiB'/>
>         <cell id='1'            memory='1' unit='GiB' initiator='0'/>
>       </numa>
>     </cpu>
>   </domain>

We've gained an 'initiator' attribute on the cell, and 'cpus' is
optional if 'initiator' is present.

Can we have the opposite - nodes with CPUs, but without local memory ?
eg

       <cell id='0' cpus='0,1'     unit='GiB'/>

> Then, QEMU allows us to control two other important memory attributes:
> 
>   1) hmat-lb for Latency and Bandwidth
> 
>   2) hmat-cache for cache attributes
> 
> For example:
> 
>   -machine hmat=on \
>   -m 2G,slots=2,maxmem=4G \
>   -object memory-backend-ram,size=1G,id=m0 \
>   -object memory-backend-ram,size=1G,id=m1 \
>   -smp 2,sockets=2,maxcpus=2 \
>   -numa node,nodeid=0,memdev=m0 \
>   -numa node,nodeid=1,memdev=m1,initiator=0 \
>   -numa cpu,node-id=0,socket-id=0 \
>   -numa cpu,node-id=0,socket-id=1 \
>   -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5
> \
>   -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M
> \
>   -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10
> \
>   -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M
> \
>   -numa hmat-cache,node-id=0,size=10K,level=1,associativity=direct,policy=write-back,line=8
> \
>   -numa hmat-cache,node-id=1,size=10K,level=1,associativity=direct,policy=write-back,line=8

> 
> This extends previous example by defining some latencies and cache
> attributes.
> The node 0 has access latency of 5 ns and bandwidth of 200MB/s and node 1
> has
> access latency of 10ns and bandwidth of only 100MB/s. The memory cache level
> 1
> on both nodes is 10KB, cache line is 8B long with write-back policy and
> direct
> associativity (whatever that means).

This description doesn't match my understanding of the semantics
for these latency options.  Your description here is talking about
latency of a single node at a time.  I believe these configs
are talking about latency of the *link* between two nodes.

So

  -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5

is a local node access latency as src+dst nodes are the same
but

  -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10

is a cross-node access latency for the link between node
0 and node 1.

> For better future extensibility I'd express these as separate elements,
> rather
> than attributes to <cell/> element. For instance like this:
> 
>   <domain>
>     <vcpu>2</vcpu>
>     <cpu>
>       <numa>
>         <cell id='0' cpus='0,1' memory='1' unit='GiB'>
>           <latencies>
>             <latency type='access' value='5'/>
>             <bandwidth type='access' unit='MiB' value='200'/>
>           </latencies>
>           <caches>
>             <cache level='1' associativity='direct' policy='write-back'>
>               <size unit='KiB' value='10'/>
>               <line unit='B' value='8'/>
>             </cache>
>           </caches>
>         </cell>
>         <cell id='1' memory='1' unit='GiB' initiator='0'>
>           <latencies>
>             <latency type='access' value='10'/>
>             <bandwidth type='access' unit='MiB' value='100'/>
>           </latencies>
>           <caches>
>             <cache level='1' associativity='direct' policy='write-back'>
>               <size unit='KiB' value='10'/>
>               <line unit='B' value='8'/>
>             </cache>
>           </caches>
>         </cell>
>       </numa>

We shouldn't have <latencies> as a child of the <cell>, because
we need to describe the latencies for the cross-product of all
cells. Putting latency as a child of a cell means we would have
2 possible places to put the same information - either the source
or target node.

The <caches> info is ok as a child of <cell>, though I'd prefer
to cull the extra <caches> wrapper and make <cache> a direct
child - we can still allow <cache> to be listed multiple times
under <cell> without the extra element.

>     </cpu>
>   </domain>
> 
> Thing is, the @hierarchy argument accepts: memory (referring to whole
> memory),
> or first-level|second-level|third-level (referring to side caches for each
> domain). I haven't figured out yet, how to express the levels in XML yet.
> 
> The @data-type argument accepts access|read|write (this is expressed by
> @type
> attribute to <latency/> and <bandwidth/> elements). Latency and bandwidth
> can
> be combined with each type: access-latency, read-latency, write-latency,
> access-bandwidth, read-bandwidth, write-bandwidth. And these 6 can then be
> combined with aforementioned @hierarchy, producing 24 combinations (if I
> read
> qemu cmd line specs correctly [2]).

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|