Re: [PATCH v2 7/9] qemu_conf: Introduce a knob to set SCHED_CORE

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Wed, 13 Jul 2022 18:22:24 +0100

On Mon, Jun 27, 2022 at 12:44:39PM +0200, Michal Privoznik wrote:
> Ideally, we would just pick the best default and users wouldn't
> have to intervene at all. But in some cases it may be handy to
> not bother with SCHED_CORE at all or place helper processes into
> the same group as QEMU. Introduce a knob in qemu.conf to allow
> users control this behaviour.
> 
> Signed-off-by: Michal Privoznik <mprivozn@xxxxxxxxxx>
> ---
>  src/qemu/libvirtd_qemu.aug         |  1 +
>  src/qemu/qemu.conf.in              | 14 ++++++++++
>  src/qemu/qemu_conf.c               | 42 ++++++++++++++++++++++++++++++
>  src/qemu/qemu_conf.h               | 11 ++++++++
>  src/qemu/test_libvirtd_qemu.aug.in |  1 +
>  5 files changed, 69 insertions(+)
> 
> diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug
> index 0f18775121..ed097ea3d9 100644
> --- a/src/qemu/libvirtd_qemu.aug
> +++ b/src/qemu/libvirtd_qemu.aug
> @@ -110,6 +110,7 @@ module Libvirtd_qemu =
>                   | bool_entry "dump_guest_core"
>                   | str_entry "stdio_handler"
>                   | int_entry "max_threads_per_process"
> +                 | str_entry "sched_core"
>  
>     let device_entry = bool_entry "mac_filter"
>                   | bool_entry "relaxed_acs_check"
> diff --git a/src/qemu/qemu.conf.in b/src/qemu/qemu.conf.in
> index 04b7740136..01c7ab5868 100644
> --- a/src/qemu/qemu.conf.in
> +++ b/src/qemu/qemu.conf.in
> @@ -952,3 +952,17 @@
>  # DO NOT use in production.
>  #
>  #deprecation_behavior = "none"
> +
> +# If this is set then QEMU and its threads will run in a separate scheduling
> +# group meaning no other process will share Hyper Threads of a single core with
> +# QEMU. Each QEMU has its own group.
> +#
> +# Possible options are:
> +# "none" - nor QEMU nor any of its helper processes are placed into separate
> +#          scheduling group
> +# "emulator" - (default) only QEMU and its threads (emulator + vCPUs) are
> +#              placed into separate scheduling group, helper proccesses remain
> +#              outside of the group.
> +# "full" - both QEMU and its helper processes are placed into separate
> +#          scheduling group.
> +#sched_core = "emulator"

Talking to the OpenStack Nova maintainers I'm remembering that life is
somewhat more complicated than we have taken into account.

Nova has a variety of tunables along semi-independant axes which can
be combined

 * CPU policy: shared vs dedicated - this is the big one, with overcommit
   being the out of the box default. In both cases they apply CPU
   pinning to the QEMU VM.

   In the case of shared, they pin to allow the VM to float freely
   over all host CPUs, except for a small subset reserved for the host OS.
   This explicitly overcommits host resources.

   In the case of dedicated, they pin to give each vCPU a corresponduing
   unique vCPU. There is broadly no overcommit of vCPUs. non-vCPU threads
   may still overcommit/compete

 * SMT policy: prefer vs isolate vs required 

   For 'prefer', it'll preferentially pick a host with SMT and give
   the VM SMT siblings, but will fallback to non-SMT hosts if not
   possible

   For 'isolate', it'll keep all-but-1 SMT sibling empty of VCPUs
   at all times

   For 'require', it'll mandate a host with SMT and give the VM
   SMT siblings

 * Emulator policy: float vs isolate vs shared

   For 'follow', the emulator threads will float across pCPUs
   assign to the same guest's vCPUs

   For 'isolate', the emulator threads will be pinned to pCPU(s)
   separate from the vCPUs. These pCPU can be chosen in two
   different ways though

      - Each VM is strictly given its own pCPU just for its
        own emulator threads. Typically used with RealTime

      - Each VM is given pCPU(s) for its emulator thread that
        can be shared with other VMs. Typically used with
	non-RealTime

In terms of core scheduling usage

  - For the default  shared model, where all VM CPUs float and
    overcommit, if we enable core scheduling the capacity of
    a host decreases. Biggest impact is if there are many
    guests with odd-CPU counts, OR many guests with even CPU
    counts but with only 1 runnable CPU at a time.

  - When emulator threads policy is 'isolate', our core scheduling
    setup could massively conflict with Nova's emulator placement.
    eg Nova could have given SMT siblings to two different guests
    for their respective emumalator threads. This is not as serious
    a security risk as sharing SMT siblings with VCPUs, as emulator
    thread code is trustworthy, unless QEMU was exploited

    The net result is that even if 2 VMs have their vCPUs runnable
    and host has pCPUs available to run them, one VM can be stalled
    by its emulator thread pinning having an SMT core scheduling
    conflict with the other VM's emulator thread pinning.

Nova can also mix-and-match the above policies between VMs in the
same host.

Finally, the above is all largely focused on VM placement. One thing
to bear in mind is that even if VMs are isolated from SMT siblings,
Nova deployments can still allow host OS processes to co-exist with
vCPU threads on SMT siblings. Our core scheduling impl would prevent
that

The goal for core scheduling  is to make the free-floating overcommit
scenario as safe as dedicated CPU pinning, while retaining the flexibility
of dynamic placement.

Core scheduling is redundant if the mgmt app has given dedicate CPU pinning
to all vCPUS and all other threads.

At the libvirt side though, we don't know whether Nova is doing overcommit
or dedicated CPUs. CPU pinning masks will be given in both cases, the
only difference is that in the dedicated case, the mask is highly likely
only list 1 pCPU bit.  I don't think we want to try to infer intent by
looking at CPU masks though.

What this means is that if we apply core scheduling by default, it needs
to be compatible with the combination of all the above options. On
reflection, I don't think we can achieve that with high enough confidence.

There's a decent case to be made that libvirt's core schedule would be
good for the overcommit case out of the box, even though it has a
capacity impact, but the ability of host OS threads to co-exist with vCPU
threads would be broken. It is clear that applying core scheduling on top
of nova's dedicated CPU pinnning policies can be massively harmful wrt
to emulator threads policies too.

So what I'm thinking is that our set of three options here are not
sufficient, we need more:

  "none" - nor QEMU nor any of its helper processes are placed into separate
           scheduling group

  "vcpus" - only QEMU vCPU threads are placed into a separate scheduling
            group, emulator threads and helper processes remain outside
	    of the group

  "emulator" - only QEMU and its threads (emulator + vCPUs) are
               placed into separate scheduling group, helper proccesses remain
               outside of the group.
  "full" - both QEMU and its helper processes are placed into separate
           scheduling group.

I don't think any of the three core scheduling options is safe enough to
use by default though. They all have a decent chance of causing regressions
for Nova, even though they'll improve security.

So reluctantly I think we need to default to "none" and require opt in.

Given Nova's mix/match of VM placement settings, I also think that a
per-VM XML knob is more likely to be neccessary than we originally
believed. At the very least being able to switch between 'vcpus'
and 'emulator' modes feels reasonably important.

Of course Nova is just one mgmt app, but we can assume that there exists
other apps that will credibly have the same requirements and thus risks
of regressions.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|