On Mon, Jun 01, 2020 at 12:44:17 +0100, Daniel P. Berrangé wrote: > There are many different settings that required to config a KVM guest > for real time, low latency workoads. The documentation included here is > based on guidance developed & tested by the Red Hat KVM real time team. > > Signed-off-by: Daniel P. Berrangé <berrange@xxxxxxxxxx> > --- > docs/kbase.html.in | 3 + > docs/kbase/kvm-realtime.rst | 213 ++++++++++++++++++++++++++++++++++++ > 2 files changed, 216 insertions(+) > create mode 100644 docs/kbase/kvm-realtime.rst > > diff --git a/docs/kbase.html.in b/docs/kbase.html.in > index c586e0f676..e663ca525f 100644 > --- a/docs/kbase.html.in > +++ b/docs/kbase.html.in > @@ -36,6 +36,9 @@ > > <dt><a href="kbase/virtiofs.html">Virtio-FS</a></dt> > <dd>Share a filesystem between the guest and the host</dd> > + > + <dt><a href="kbase/kvm-realtime.html">KVM real time</a></dt> > + <dd>Run real time workloads in guests on a KVM hypervisor</dd> > </dl> > </div> > > diff --git a/docs/kbase/kvm-realtime.rst b/docs/kbase/kvm-realtime.rst > new file mode 100644 > index 0000000000..ac6102879b > --- /dev/null > +++ b/docs/kbase/kvm-realtime.rst > @@ -0,0 +1,213 @@ > +========================== > +KVM Real Time Guest Config > +========================== > + > +.. contents:: > + > +The KVM hypervisor is capable of running real time guest workloads. This page > +describes the key pieces of configuration required in the domain XML to achieve > +the low latency needs of real time workloads. > + > +For the most part, configuration of the host OS is out of scope of this > +documentation. Refer to the operating system vendor's guidance on configuring > +the host OS and hardware for real time. Note in particular that the default > +kernel used by most Linux distros is not suitable for low latency real time and > +must be replaced by an special kernel build. s/an/a/ > + > + > +Host partitioning plan > +====================== > + > +Running real time workloads requires carefully partitioning up the host OS > +resources, such that the KVM / QEMU processes are strictly separated from any > +other workload running on the host, both userspace processes and kernel threads. > + > +As such, some subset of host CPUs need to be reserved exclusively for running > +KVM guests. This requires that the host kernel be booted using the ``isolcpus`` > +kernel command line parameter. This parameter removes a set of CPUs from the > +schedular, such that that no kernel threads or userspace processes will ever get s/schedular/scheduler/ (run the replacement for the whole file as there are more instances of this type that I won't explicitly point to) > +placed on those CPUs automatically. KVM guests are then manually placed onto > +these CPUs. > + > +Deciding which host CPUs to reserve for real time requires understanding of the > +guest workload needs and balancing with the host OS needs. The trade off will > +also vary based on the physical hardware available. > + > +For the sake of illustration, this guide will assume a physical machine with two > +NUMA nodes, each with 2 sockets and 4 cores, giving a total of 16 CPUs on the "... and 4 cores in each socket" perhaps? The way it is written now suggests there are 4 course in total in each NUMA node split into two sockets, that is 2 cores per socket. It's certainly possibly my way of thinking is incorrect, but being explicit cannot hurt :-) > +host. Furthermore, it is assumed that hyperthreading is either not supported or > +has been disabled in the BIOS, since it is incompatible with real time. Each > +NUMA node is assumed to have 32 GB of RAM, giving 64 GB total for the host. > + > +It is assumed that 2 CPUs in each NUMA node are reserved for the host OS, with > +the remaining 6 CPUs available for KVM real time. With this in mind, the host > +kernel should have booted with ``isolcpus=2-7,10,15`` to reserve CPUs. s/10,15/10-15/ > + > +To maximise efficiency of page table lookups for the guest, the host needs to be > +configured with most RAM exposed as huge pages, ideally 1 GB sized. 6 GB of RAM > +in each NUMA node will be reserved for general host OS usage as normal sized > +pages, leaving 26 GB for KVM usage as huge pages. > + > +Once huge pages are reserved on the hypothetical machine, the ``virsh > +capabilities`` command output is expected to look approximately like: > + > +:: > + > + <topology> > + <cells num='2'> > + <cell id='0'> > + <memory unit='KiB'>33554432</memory> > + <pages unit='KiB' size='4'>1572864</pages> > + <pages unit='KiB' size='2048'>0</pages> > + <pages unit='KiB' size='1048576'>26</pages> > + <distances> > + <sibling id='0' value='10'/> > + <sibling id='1' value='21'/> > + </distances> > + <cpus num='8'> > + <cpu id='0' socket_id='0' core_id='0' siblings='0'/> > + <cpu id='1' socket_id='0' core_id='1' siblings='1'/> > + <cpu id='2' socket_id='0' core_id='2' siblings='2'/> > + <cpu id='3' socket_id='0' core_id='3' siblings='3'/> > + <cpu id='4' socket_id='1' core_id='0' siblings='4'/> > + <cpu id='5' socket_id='1' core_id='1' siblings='5'/> > + <cpu id='6' socket_id='1' core_id='2' siblings='6'/> > + <cpu id='7' socket_id='1' core_id='3' siblings='7'/> > + </cpus> > + </cell> > + <cell id='1'> > + <memory unit='KiB'>33554432</memory> > + <pages unit='KiB' size='4'>1572864</pages> > + <pages unit='KiB' size='2048'>0</pages> > + <pages unit='KiB' size='1048576'>26</pages> > + <distances> > + <sibling id='0' value='21'/> > + <sibling id='1' value='10'/> > + </distances> > + <cpus num='8'> > + <cpu id='8' socket_id='0' core_id='0' siblings='8'/> > + <cpu id='9' socket_id='0' core_id='1' siblings='9'/> > + <cpu id='10' socket_id='0' core_id='2' siblings='10'/> > + <cpu id='11' socket_id='0' core_id='3' siblings='11'/> > + <cpu id='12' socket_id='1' core_id='0' siblings='12'/> > + <cpu id='13' socket_id='1' core_id='1' siblings='13'/> > + <cpu id='14' socket_id='1' core_id='2' siblings='14'/> > + <cpu id='15' socket_id='1' core_id='3' siblings='15'/> > + </cpus> > + </cell> > + </cells> > + </topology> > + > +Be aware that CPU ID numbers are not always allocated sequentially as shown > +here. It is not unusual to see IDs interleaved between sockets on the two NUMA > +nodes, such that ``0-3,8-11`` are be on the first node and ``4-7,12-15`` are on s/are be /are / > +the second node. Carefully check the ``virsh capabilities`` output to determine > +the CPU ID numbers when configiring both ``isolcpus`` and the guest ``cpuset`` > +values. > + > +Guest configuration > +=================== > + > +What follows is an overview of the key parts of the domain XML that need to be > +configured to achieve low latency for real time workflows. The following example > +will assume a 4 CPU guest, requiring 16 GB of RAM. It is intended to be placed > +on the second host NUMA node. > + > +CPU configuration > +----------------- > + > +Real time KVM guests intended to run Linux should have a minimum of 2 CPUs. > +One vCPU is for running non-real time processes and performing I/O. The other > +vCPUs will run real time applications. Some non-Linux OS may not require a > +special non-real time CPU to be available, in which case the 2 CPU minimum would > +not apply. > + > +Each guest CPU, even the non-real time one, needs to be pinned to a dedicated > +host core that is in the `isolcpus` reserved set. The QEMU emulator threads > +also need to be pinned to host CPUs that are not in the `isolcpus` reserved set. Seeing "also" I'd expect even emulator threads need to be pinned to CPUs in the isolcpus set. If this is not the case and "that are not in the isolcpus reserved set" part is correct, I suggest removing "also" or emphasize the different sets of host CPUs. For example: "also need to be pinned to host CPUs, but ..." > +The vCPUs need to be given a real time CPU schedular policy. > + > +When configuring the `guest CPU count <../formatdomain.html#elementsCPUAllocation>`_, > +do not include any CPU affinity are this stage: s/are/at/ > + > +:: > + > + <vcpu placement='static'>4</vcpu> > + > +The guest CPUs now need to be placed individually. In this case, they will all > +be put within the same host socket, such that they can be exposed as core > +siblings. This is achieved using the `CPU tunning config <../formatdomain.html#elementsCPUTuning>`_: > + > +:: > + > + <cputune> > + <emulatorpin cpuset="8-9"/> > + <vcpupin vcpu="0" cpuset="12"/> > + <vcpupin vcpu="1" cpuset="13"/> > + <vcpupin vcpu="2" cpuset="14"/> > + <vcpupin vcpu="3" cpuset="15"/> > + <vcpusched vcpus='0-4' scheduler='fifo' priority='1'/> > + </cputune> > + > +The `guest CPU model <formatdomain.html#elementsCPU>`_ now needs to be > +configured to pass through the host model unchanged, with topology matching the > +placement: > + > +:: > + > + <cpu mode='host-passthrough'> > + <topology sockets='1' dies='1' cores='4' threads='1'/> > + <feature policy='require' name='tsc-deadline'/> > + </cpu> > + > +The performance monitoring unit virtualization needs to be disabled > +via the `hypervisor features <../formatdomain.html#elementsFeatures>`_: > + > +:: > + > + <features> > + ... > + <pmu state='off'/> > + </features> > + > + > +Memory configuration > +-------------------- > + > +The host memory used for guest RAM needs to be allocated from huge pages on the > +second NUMA node, and all other memory allocation needs to be locked into RAM > +with memory page sharing disabled. > +This is achieved by using the `memory backing config <formatdomain.html#elementsMemoryBacking>`_: > + > +:: > + > + <memoryBacking> > + <hugepages> > + <page size="1" unit="G" nodeset="1"/> > + </hugepages> > + <locked/> > + <nosharepages/> > + </memoryBacking> > + > + > +Device configuration > +-------------------- > + > +Libvirt adds a few devices by default to maintain historical QEMU configuration > +behaviour. It is unlikely these devices are required by real time guests, so it > +is wise to disable them. Remove all USB controllers that may exist in the XML > +config and replace them with: > + > +:: > + > + <controller type="usb" model="none"/> > + > +Similarly the memory balloon config should be changed to > + > +:: > + > + <memballoon model="none"/> > + > +If the guest had a graphical console at installation time this can also be > +disabled, with remote access being over SSH, with a minimal serial console > +for emergencies. With the above minor issues addressed: Reviewed-by: Jiri Denemark <jdenemar@xxxxxxxxxx>