There are many different settings that required to config a KVM guest for real time, low latency workoads. The documentation included here is based on guidance developed & tested by the Red Hat KVM real time team. Signed-off-by: Daniel P. Berrangé <berrange@xxxxxxxxxx> --- docs/kbase.html.in | 3 + docs/kbase/kvm-realtime.rst | 213 ++++++++++++++++++++++++++++++++++++ 2 files changed, 216 insertions(+) create mode 100644 docs/kbase/kvm-realtime.rst diff --git a/docs/kbase.html.in b/docs/kbase.html.in index c586e0f676..e663ca525f 100644 --- a/docs/kbase.html.in +++ b/docs/kbase.html.in @@ -36,6 +36,9 @@ <dt><a href="kbase/virtiofs.html">Virtio-FS</a></dt> <dd>Share a filesystem between the guest and the host</dd> + + <dt><a href="kbase/kvm-realtime.html">KVM real time</a></dt> + <dd>Run real time workloads in guests on a KVM hypervisor</dd> </dl> </div> diff --git a/docs/kbase/kvm-realtime.rst b/docs/kbase/kvm-realtime.rst new file mode 100644 index 0000000000..ac6102879b --- /dev/null +++ b/docs/kbase/kvm-realtime.rst @@ -0,0 +1,213 @@ +========================== +KVM Real Time Guest Config +========================== + +.. contents:: + +The KVM hypervisor is capable of running real time guest workloads. This page +describes the key pieces of configuration required in the domain XML to achieve +the low latency needs of real time workloads. + +For the most part, configuration of the host OS is out of scope of this +documentation. Refer to the operating system vendor's guidance on configuring +the host OS and hardware for real time. Note in particular that the default +kernel used by most Linux distros is not suitable for low latency real time and +must be replaced by an special kernel build. + + +Host partitioning plan +====================== + +Running real time workloads requires carefully partitioning up the host OS +resources, such that the KVM / QEMU processes are strictly separated from any +other workload running on the host, both userspace processes and kernel threads. + +As such, some subset of host CPUs need to be reserved exclusively for running +KVM guests. This requires that the host kernel be booted using the ``isolcpus`` +kernel command line parameter. This parameter removes a set of CPUs from the +schedular, such that that no kernel threads or userspace processes will ever get +placed on those CPUs automatically. KVM guests are then manually placed onto +these CPUs. + +Deciding which host CPUs to reserve for real time requires understanding of the +guest workload needs and balancing with the host OS needs. The trade off will +also vary based on the physical hardware available. + +For the sake of illustration, this guide will assume a physical machine with two +NUMA nodes, each with 2 sockets and 4 cores, giving a total of 16 CPUs on the +host. Furthermore, it is assumed that hyperthreading is either not supported or +has been disabled in the BIOS, since it is incompatible with real time. Each +NUMA node is assumed to have 32 GB of RAM, giving 64 GB total for the host. + +It is assumed that 2 CPUs in each NUMA node are reserved for the host OS, with +the remaining 6 CPUs available for KVM real time. With this in mind, the host +kernel should have booted with ``isolcpus=2-7,10,15`` to reserve CPUs. + +To maximise efficiency of page table lookups for the guest, the host needs to be +configured with most RAM exposed as huge pages, ideally 1 GB sized. 6 GB of RAM +in each NUMA node will be reserved for general host OS usage as normal sized +pages, leaving 26 GB for KVM usage as huge pages. + +Once huge pages are reserved on the hypothetical machine, the ``virsh +capabilities`` command output is expected to look approximately like: + +:: + + <topology> + <cells num='2'> + <cell id='0'> + <memory unit='KiB'>33554432</memory> + <pages unit='KiB' size='4'>1572864</pages> + <pages unit='KiB' size='2048'>0</pages> + <pages unit='KiB' size='1048576'>26</pages> + <distances> + <sibling id='0' value='10'/> + <sibling id='1' value='21'/> + </distances> + <cpus num='8'> + <cpu id='0' socket_id='0' core_id='0' siblings='0'/> + <cpu id='1' socket_id='0' core_id='1' siblings='1'/> + <cpu id='2' socket_id='0' core_id='2' siblings='2'/> + <cpu id='3' socket_id='0' core_id='3' siblings='3'/> + <cpu id='4' socket_id='1' core_id='0' siblings='4'/> + <cpu id='5' socket_id='1' core_id='1' siblings='5'/> + <cpu id='6' socket_id='1' core_id='2' siblings='6'/> + <cpu id='7' socket_id='1' core_id='3' siblings='7'/> + </cpus> + </cell> + <cell id='1'> + <memory unit='KiB'>33554432</memory> + <pages unit='KiB' size='4'>1572864</pages> + <pages unit='KiB' size='2048'>0</pages> + <pages unit='KiB' size='1048576'>26</pages> + <distances> + <sibling id='0' value='21'/> + <sibling id='1' value='10'/> + </distances> + <cpus num='8'> + <cpu id='8' socket_id='0' core_id='0' siblings='8'/> + <cpu id='9' socket_id='0' core_id='1' siblings='9'/> + <cpu id='10' socket_id='0' core_id='2' siblings='10'/> + <cpu id='11' socket_id='0' core_id='3' siblings='11'/> + <cpu id='12' socket_id='1' core_id='0' siblings='12'/> + <cpu id='13' socket_id='1' core_id='1' siblings='13'/> + <cpu id='14' socket_id='1' core_id='2' siblings='14'/> + <cpu id='15' socket_id='1' core_id='3' siblings='15'/> + </cpus> + </cell> + </cells> + </topology> + +Be aware that CPU ID numbers are not always allocated sequentially as shown +here. It is not unusual to see IDs interleaved between sockets on the two NUMA +nodes, such that ``0-3,8-11`` are be on the first node and ``4-7,12-15`` are on +the second node. Carefully check the ``virsh capabilities`` output to determine +the CPU ID numbers when configiring both ``isolcpus`` and the guest ``cpuset`` +values. + +Guest configuration +=================== + +What follows is an overview of the key parts of the domain XML that need to be +configured to achieve low latency for real time workflows. The following example +will assume a 4 CPU guest, requiring 16 GB of RAM. It is intended to be placed +on the second host NUMA node. + +CPU configuration +----------------- + +Real time KVM guests intended to run Linux should have a minimum of 2 CPUs. +One vCPU is for running non-real time processes and performing I/O. The other +vCPUs will run real time applications. Some non-Linux OS may not require a +special non-real time CPU to be available, in which case the 2 CPU minimum would +not apply. + +Each guest CPU, even the non-real time one, needs to be pinned to a dedicated +host core that is in the `isolcpus` reserved set. The QEMU emulator threads +also need to be pinned to host CPUs that are not in the `isolcpus` reserved set. +The vCPUs need to be given a real time CPU schedular policy. + +When configuring the `guest CPU count <../formatdomain.html#elementsCPUAllocation>`_, +do not include any CPU affinity are this stage: + +:: + + <vcpu placement='static'>4</vcpu> + +The guest CPUs now need to be placed individually. In this case, they will all +be put within the same host socket, such that they can be exposed as core +siblings. This is achieved using the `CPU tunning config <../formatdomain.html#elementsCPUTuning>`_: + +:: + + <cputune> + <emulatorpin cpuset="8-9"/> + <vcpupin vcpu="0" cpuset="12"/> + <vcpupin vcpu="1" cpuset="13"/> + <vcpupin vcpu="2" cpuset="14"/> + <vcpupin vcpu="3" cpuset="15"/> + <vcpusched vcpus='0-4' scheduler='fifo' priority='1'/> + </cputune> + +The `guest CPU model <formatdomain.html#elementsCPU>`_ now needs to be +configured to pass through the host model unchanged, with topology matching the +placement: + +:: + + <cpu mode='host-passthrough'> + <topology sockets='1' dies='1' cores='4' threads='1'/> + <feature policy='require' name='tsc-deadline'/> + </cpu> + +The performance monitoring unit virtualization needs to be disabled +via the `hypervisor features <../formatdomain.html#elementsFeatures>`_: + +:: + + <features> + ... + <pmu state='off'/> + </features> + + +Memory configuration +-------------------- + +The host memory used for guest RAM needs to be allocated from huge pages on the +second NUMA node, and all other memory allocation needs to be locked into RAM +with memory page sharing disabled. +This is achieved by using the `memory backing config <formatdomain.html#elementsMemoryBacking>`_: + +:: + + <memoryBacking> + <hugepages> + <page size="1" unit="G" nodeset="1"/> + </hugepages> + <locked/> + <nosharepages/> + </memoryBacking> + + +Device configuration +-------------------- + +Libvirt adds a few devices by default to maintain historical QEMU configuration +behaviour. It is unlikely these devices are required by real time guests, so it +is wise to disable them. Remove all USB controllers that may exist in the XML +config and replace them with: + +:: + + <controller type="usb" model="none"/> + +Similarly the memory balloon config should be changed to + +:: + + <memballoon model="none"/> + +If the guest had a graphical console at installation time this can also be +disabled, with remote access being over SSH, with a minimal serial console +for emergencies. -- 2.26.2