From: Wim ten Have <wim.ten.have@xxxxxxxxxx> This patch extents guest domain administration adding support to advertise node sibling distances when configuring HVM numa guests. NUMA (non-uniform memory access), a method of configuring a cluster of nodes within a single multiprocessing system such that it shares processor local memory amongst others improving performance and the ability of the system to be expanded. A NUMA system could be illustrated as shown below. Within this 4-node system, every socket is equipped with its own distinct memory. The whole typically resembles a SMP (symmetric multiprocessing) system being a "tightly-coupled," "share everything" system in which multiple processors are working under a single operating system and can access each others' memory over multiple "Bus Interconnect" paths. +-----+-----+-----+ +-----+-----+-----+ | M | CPU | CPU | | CPU | CPU | M | | E | | | | | | E | | M +- Socket0 -+ +- Socket3 -+ M | | O | | | | | | O | | R | CPU | CPU <---------> CPU | CPU | R | | Y | | | | | | Y | +-----+--^--+-----+ +-----+--^--+-----+ | | | Bus Interconnect | | | +-----+--v--+-----+ +-----+--v--+-----+ | M | | | | | | M | | E | CPU | CPU <---------> CPU | CPU | E | | M | | | | | | M | | O +- Socket1 -+ +- Socket2 -+ O | | R | | | | | | R | | Y | CPU | CPU | | CPU | CPU | Y | +-----+-----+-----+ +-----+-----+-----+ In contrast there is the limitation of a flat SMP system, not illustrated. Here, as sockets are added, the bus (data and address path), under high activity, gets overloaded and easily becomes a performance bottleneck. NUMA adds an intermediate level of memory shared amongst a few cores per socket as illustrated above, so that data accesses do not have to travel over a single bus. Unfortunately the way NUMA does this adds its own limitations. This, as visualized in the illustration above, happens when data is stored in memory associated with Socket2 and is accessed by a CPU (core) in Socket0. The processors use the "Bus Interconnect" to create gateways between the sockets (nodes) enabling inter-socket access to memory. These "Bus Interconnect" hops add data access delays when a CPU (core) accesses memory associated with a remote socket (node). For terminology we refer to sockets as "nodes" where access to each others' distinct resources such as memory make them "siblings" with a designated "distance" between them. A specific design is described under the ACPI (Advanced Configuration and Power Interface Specification) within the chapter explaining the system's SLIT (System Locality Distance Information Table). These patches extend core libvirt's XML description of a virtual machine's hardware to include NUMA distance information for sibling nodes, which is then passed to Xen guests via libxl. Recently qemu landed support for constructing the SLIT since commit 0f203430dd ("numa: Allow setting NUMA distance for different NUMA nodes"), hence these core libvirt extensions can also help other drivers in supporting this feature. The XML changes made allow to describe the <cell> (or node/sockets) <distances> amongst <sibling> node identifiers and propagate these towards the numa domain functionality finally adding support to libxl. [below is an example illustrating a 4 node/socket <cell> setup] <cpu> <numa> <cell id='0' cpus='0,4-7' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> <sibling id='2' value='31'/> <sibling id='3' value='41'/> </distances> </cell> <cell id='1' cpus='1,8-10,12-15' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> <sibling id='2' value='21'/> <sibling id='3' value='31'/> </distances> </cell> <cell id='2' cpus='2,11' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='31'/> <sibling id='1' value='21'/> <sibling id='2' value='10'/> <sibling id='3' value='21'/> </distances> </cell> <cell id='3' cpus='3' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='41'/> <sibling id='1' value='31'/> <sibling id='2' value='21'/> <sibling id='3' value='10'/> </distances> </cell> </numa> </cpu> By default on libxl, if no <distances> are given to describe the SLIT data between different <cell>s, this patch will default to a scheme using 10 for local and 21 for any remote node/socket, which is the assumption of guest OS when no SLIT is specified. While SLIT is optional, libxl requires that distances are set nonetheless. On Linux systems the SLIT detail can be listed with help of the 'numactl -H' command. An above HVM guest as described would on such prompt with below output. [root@f25 ~]# numactl -H available: 4 nodes (0-3) node 0 cpus: 0 4 5 6 7 node 0 size: 1988 MB node 0 free: 1743 MB node 1 cpus: 1 8 9 10 12 13 14 15 node 1 size: 1946 MB node 1 free: 1885 MB node 2 cpus: 2 11 node 2 size: 2011 MB node 2 free: 1912 MB node 3 cpus: 3 node 3 size: 2010 MB node 3 free: 1980 MB node distances: node 0 1 2 3 0: 10 21 31 41 1: 21 10 21 31 2: 31 21 10 21 3: 41 31 21 10 Wim ten Have (4): numa: describe siblings distances within cells libxl: vnuma support xenconfig: add domxml conversions for xen-xl xlconfigtest: add tests for numa cell sibling distances docs/formatdomain.html.in | 70 ++++- docs/schemas/basictypes.rng | 9 + docs/schemas/cputypes.rng | 18 ++ src/conf/cpu_conf.c | 2 +- src/conf/numa_conf.c | 323 +++++++++++++++++++- src/conf/numa_conf.h | 25 +- src/libvirt_private.syms | 6 + src/libxl/libxl_conf.c | 120 ++++++++ src/libxl/libxl_driver.c | 3 +- src/xenconfig/xen_xl.c | 333 +++++++++++++++++++++ .../test-fullvirt-vnuma-nodistances.cfg | 26 ++ .../test-fullvirt-vnuma-nodistances.xml | 53 ++++ tests/xlconfigdata/test-fullvirt-vnuma.cfg | 26 ++ tests/xlconfigdata/test-fullvirt-vnuma.xml | 81 +++++ tests/xlconfigtest.c | 4 + 15 files changed, 1089 insertions(+), 10 deletions(-) create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma-nodistances.cfg create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma-nodistances.xml create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma.cfg create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma.xml -- 2.9.5 -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list