On Thu, Jan 09, 2020 at 05:18:02PM +0100, Michal Privoznik wrote: > Dear list, > > QEMU gained support for configuring HMAT recently (see > v4.2.0-415-g9b12dfa03a > and friends). HMAT stands for Heterogeneous Memory Attribute Table and > defines > various attributes to NUMA. Guest OS/app can read these information and fine > tune optimization. See [1] for more info (esp. links in the transcript). > > QEMU defines so called initiator, which is an attribute to a NUMA node and > if > specified points to another node that has the best performance to this node. > > For instance: > > -machine hmat=on \ > -m 2G,slots=2,maxmem=4G \ > -object memory-backend-ram,size=1G,id=m0 \ > -object memory-backend-ram,size=1G,id=m1 \ > -numa node,nodeid=0,memdev=m0 \ > -numa node,nodeid=1,memdev=m1,initiator=0 \ > -smp 2,sockets=2,maxcpus=2 \ > -numa cpu,node-id=0,socket-id=0 \ > -numa cpu,node-id=0,socket-id=1 > > creates a machine with 2 NUMA nodes, node 0 has CPUs and node 1 has memory > only > and it's initiator is node 0 (yes, HMAT allows you to create CPU-less "NUMA" > nodes). The initiator of node 0 is not specified, but since the node has at > least one CPU it is initiator to itself (and has to be per specs). > > This could be represented by an attribute to our /domain/cpu/numa/cell > element. > For instance like this: > > <domain> > <vcpu>2</vcpu> > <cpu> > <numa> > <cell id='0' cpus='0,1' memory='1' unit='GiB'/> > <cell id='1' memory='1' unit='GiB' initiator='0'/> > </numa> > </cpu> > </domain> We've gained an 'initiator' attribute on the cell, and 'cpus' is optional if 'initiator' is present. Can we have the opposite - nodes with CPUs, but without local memory ? eg <cell id='0' cpus='0,1' unit='GiB'/> > Then, QEMU allows us to control two other important memory attributes: > > 1) hmat-lb for Latency and Bandwidth > > 2) hmat-cache for cache attributes > > For example: > > -machine hmat=on \ > -m 2G,slots=2,maxmem=4G \ > -object memory-backend-ram,size=1G,id=m0 \ > -object memory-backend-ram,size=1G,id=m1 \ > -smp 2,sockets=2,maxcpus=2 \ > -numa node,nodeid=0,memdev=m0 \ > -numa node,nodeid=1,memdev=m1,initiator=0 \ > -numa cpu,node-id=0,socket-id=0 \ > -numa cpu,node-id=0,socket-id=1 \ > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 > \ > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M > \ > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10 > \ > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M > \ > -numa hmat-cache,node-id=0,size=10K,level=1,associativity=direct,policy=write-back,line=8 > \ > -numa hmat-cache,node-id=1,size=10K,level=1,associativity=direct,policy=write-back,line=8 > > This extends previous example by defining some latencies and cache > attributes. > The node 0 has access latency of 5 ns and bandwidth of 200MB/s and node 1 > has > access latency of 10ns and bandwidth of only 100MB/s. The memory cache level > 1 > on both nodes is 10KB, cache line is 8B long with write-back policy and > direct > associativity (whatever that means). This description doesn't match my understanding of the semantics for these latency options. Your description here is talking about latency of a single node at a time. I believe these configs are talking about latency of the *link* between two nodes. So -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 is a local node access latency as src+dst nodes are the same but -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10 is a cross-node access latency for the link between node 0 and node 1. > For better future extensibility I'd express these as separate elements, > rather > than attributes to <cell/> element. For instance like this: > > <domain> > <vcpu>2</vcpu> > <cpu> > <numa> > <cell id='0' cpus='0,1' memory='1' unit='GiB'> > <latencies> > <latency type='access' value='5'/> > <bandwidth type='access' unit='MiB' value='200'/> > </latencies> > <caches> > <cache level='1' associativity='direct' policy='write-back'> > <size unit='KiB' value='10'/> > <line unit='B' value='8'/> > </cache> > </caches> > </cell> > <cell id='1' memory='1' unit='GiB' initiator='0'> > <latencies> > <latency type='access' value='10'/> > <bandwidth type='access' unit='MiB' value='100'/> > </latencies> > <caches> > <cache level='1' associativity='direct' policy='write-back'> > <size unit='KiB' value='10'/> > <line unit='B' value='8'/> > </cache> > </caches> > </cell> > </numa> We shouldn't have <latencies> as a child of the <cell>, because we need to describe the latencies for the cross-product of all cells. Putting latency as a child of a cell means we would have 2 possible places to put the same information - either the source or target node. The <caches> info is ok as a child of <cell>, though I'd prefer to cull the extra <caches> wrapper and make <cache> a direct child - we can still allow <cache> to be listed multiple times under <cell> without the extra element. > </cpu> > </domain> > > Thing is, the @hierarchy argument accepts: memory (referring to whole > memory), > or first-level|second-level|third-level (referring to side caches for each > domain). I haven't figured out yet, how to express the levels in XML yet. > > The @data-type argument accepts access|read|write (this is expressed by > @type > attribute to <latency/> and <bandwidth/> elements). Latency and bandwidth > can > be combined with each type: access-latency, read-latency, write-latency, > access-bandwidth, read-bandwidth, write-bandwidth. And these 6 can then be > combined with aforementioned @hierarchy, producing 24 combinations (if I > read > qemu cmd line specs correctly [2]). Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|