On Mon, Jun 27, 2022 at 12:44:39PM +0200, Michal Privoznik wrote: > Ideally, we would just pick the best default and users wouldn't > have to intervene at all. But in some cases it may be handy to > not bother with SCHED_CORE at all or place helper processes into > the same group as QEMU. Introduce a knob in qemu.conf to allow > users control this behaviour. > > Signed-off-by: Michal Privoznik <mprivozn@xxxxxxxxxx> > --- > src/qemu/libvirtd_qemu.aug | 1 + > src/qemu/qemu.conf.in | 14 ++++++++++ > src/qemu/qemu_conf.c | 42 ++++++++++++++++++++++++++++++ > src/qemu/qemu_conf.h | 11 ++++++++ > src/qemu/test_libvirtd_qemu.aug.in | 1 + > 5 files changed, 69 insertions(+) > > diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug > index 0f18775121..ed097ea3d9 100644 > --- a/src/qemu/libvirtd_qemu.aug > +++ b/src/qemu/libvirtd_qemu.aug > @@ -110,6 +110,7 @@ module Libvirtd_qemu = > | bool_entry "dump_guest_core" > | str_entry "stdio_handler" > | int_entry "max_threads_per_process" > + | str_entry "sched_core" > > let device_entry = bool_entry "mac_filter" > | bool_entry "relaxed_acs_check" > diff --git a/src/qemu/qemu.conf.in b/src/qemu/qemu.conf.in > index 04b7740136..01c7ab5868 100644 > --- a/src/qemu/qemu.conf.in > +++ b/src/qemu/qemu.conf.in > @@ -952,3 +952,17 @@ > # DO NOT use in production. > # > #deprecation_behavior = "none" > + > +# If this is set then QEMU and its threads will run in a separate scheduling > +# group meaning no other process will share Hyper Threads of a single core with > +# QEMU. Each QEMU has its own group. > +# > +# Possible options are: > +# "none" - nor QEMU nor any of its helper processes are placed into separate > +# scheduling group > +# "emulator" - (default) only QEMU and its threads (emulator + vCPUs) are > +# placed into separate scheduling group, helper proccesses remain > +# outside of the group. > +# "full" - both QEMU and its helper processes are placed into separate > +# scheduling group. > +#sched_core = "emulator" Talking to the OpenStack Nova maintainers I'm remembering that life is somewhat more complicated than we have taken into account. Nova has a variety of tunables along semi-independant axes which can be combined * CPU policy: shared vs dedicated - this is the big one, with overcommit being the out of the box default. In both cases they apply CPU pinning to the QEMU VM. In the case of shared, they pin to allow the VM to float freely over all host CPUs, except for a small subset reserved for the host OS. This explicitly overcommits host resources. In the case of dedicated, they pin to give each vCPU a corresponduing unique vCPU. There is broadly no overcommit of vCPUs. non-vCPU threads may still overcommit/compete * SMT policy: prefer vs isolate vs required For 'prefer', it'll preferentially pick a host with SMT and give the VM SMT siblings, but will fallback to non-SMT hosts if not possible For 'isolate', it'll keep all-but-1 SMT sibling empty of VCPUs at all times For 'require', it'll mandate a host with SMT and give the VM SMT siblings * Emulator policy: float vs isolate vs shared For 'follow', the emulator threads will float across pCPUs assign to the same guest's vCPUs For 'isolate', the emulator threads will be pinned to pCPU(s) separate from the vCPUs. These pCPU can be chosen in two different ways though - Each VM is strictly given its own pCPU just for its own emulator threads. Typically used with RealTime - Each VM is given pCPU(s) for its emulator thread that can be shared with other VMs. Typically used with non-RealTime In terms of core scheduling usage - For the default shared model, where all VM CPUs float and overcommit, if we enable core scheduling the capacity of a host decreases. Biggest impact is if there are many guests with odd-CPU counts, OR many guests with even CPU counts but with only 1 runnable CPU at a time. - When emulator threads policy is 'isolate', our core scheduling setup could massively conflict with Nova's emulator placement. eg Nova could have given SMT siblings to two different guests for their respective emumalator threads. This is not as serious a security risk as sharing SMT siblings with VCPUs, as emulator thread code is trustworthy, unless QEMU was exploited The net result is that even if 2 VMs have their vCPUs runnable and host has pCPUs available to run them, one VM can be stalled by its emulator thread pinning having an SMT core scheduling conflict with the other VM's emulator thread pinning. Nova can also mix-and-match the above policies between VMs in the same host. Finally, the above is all largely focused on VM placement. One thing to bear in mind is that even if VMs are isolated from SMT siblings, Nova deployments can still allow host OS processes to co-exist with vCPU threads on SMT siblings. Our core scheduling impl would prevent that The goal for core scheduling is to make the free-floating overcommit scenario as safe as dedicated CPU pinning, while retaining the flexibility of dynamic placement. Core scheduling is redundant if the mgmt app has given dedicate CPU pinning to all vCPUS and all other threads. At the libvirt side though, we don't know whether Nova is doing overcommit or dedicated CPUs. CPU pinning masks will be given in both cases, the only difference is that in the dedicated case, the mask is highly likely only list 1 pCPU bit. I don't think we want to try to infer intent by looking at CPU masks though. What this means is that if we apply core scheduling by default, it needs to be compatible with the combination of all the above options. On reflection, I don't think we can achieve that with high enough confidence. There's a decent case to be made that libvirt's core schedule would be good for the overcommit case out of the box, even though it has a capacity impact, but the ability of host OS threads to co-exist with vCPU threads would be broken. It is clear that applying core scheduling on top of nova's dedicated CPU pinnning policies can be massively harmful wrt to emulator threads policies too. So what I'm thinking is that our set of three options here are not sufficient, we need more: "none" - nor QEMU nor any of its helper processes are placed into separate scheduling group "vcpus" - only QEMU vCPU threads are placed into a separate scheduling group, emulator threads and helper processes remain outside of the group "emulator" - only QEMU and its threads (emulator + vCPUs) are placed into separate scheduling group, helper proccesses remain outside of the group. "full" - both QEMU and its helper processes are placed into separate scheduling group. I don't think any of the three core scheduling options is safe enough to use by default though. They all have a decent chance of causing regressions for Nova, even though they'll improve security. So reluctantly I think we need to default to "none" and require opt in. Given Nova's mix/match of VM placement settings, I also think that a per-VM XML knob is more likely to be neccessary than we originally believed. At the very least being able to switch between 'vcpus' and 'emulator' modes feels reasonably important. Of course Nova is just one mgmt app, but we can assume that there exists other apps that will credibly have the same requirements and thus risks of regressions. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|