On 5/24/22 12:33, Daniel P. Berrangé wrote: > On Tue, May 24, 2022 at 11:50:50AM +0200, Michal Prívozník wrote: >> On 5/23/22 18:30, Daniel P. Berrangé wrote: >>> On Mon, May 09, 2022 at 05:02:17PM +0200, Michal Privoznik wrote: >>>> Since the level of trust that QEMU has is the same level of trust >>>> that helper processes have there's no harm in placing all of them >>>> into the same group. >>> >>> This assumption feels like it might be a bit of a stretch. I >>> recall discussing this with Paolo to some extent a long time >>> back, but let me recap my understanding. >>> >>> IIUC, the attack scenario is that a guest vCPU thread is scheduled >>> on a SMT sibling with another thread that is NOT running guest OS >>> code. "another thread" in this context refers to many things >>> >>> - Random host OS processes >>> - QEMU vCPU threads from a different geust >>> - QEMU emulator threads from any guest >>> - QEMU helper process threads from any guest >>> >>> Consider for example, if the QEMU emulator thread contains a password >>> used for logging into a remote RBD/Ceph server. That is a secret >>> credential that the guest OS should not have permission to access. >>> >>> Consider alternatively that the QEMU emulator is making a TLS connection >>> to some service, and there are keys negotiated for the TLS session. While >>> some of the data transmitted over the session is known to the guest OS, >>> we shouldn't assume it all is. >>> >>> Now in the case of QEMU emulator threads I think you can make a somewhat >>> decent case that we don't have to worry about it. Most of the keys/passwds >>> are used once at cold boot, so there's no attack window for vCPUs at that >>> point. There is a small window of risk when hotplugging. If someone is >>> really concerned about this though, they shouldn't have let QEMU have >>> these credentials in the first place, as its already vulnerable to a >>> guest escape. eg use kernel RBD instead of letting QEMU directly login >>> to RBD. >>> >>> IOW, on balance of probabilities it is reasonable to let QEMU emulator >>> threads be in the same core scheduling domain as vCPU threads. >>> >>> In the case of external QEMU helper processes though, I think it is >>> a far less clearcut decision. There are a number of reasons why helper >>> processes are used, but at least one significant motivating factor is >>> security isolation between QEMU & the helper - they can only communicate >>> and share information through certain controlled mechanisms. >>> >>> With this in mind I think it is risky to assume that it is safe to >>> run QEMU and helper processes in the same core scheduling group. At >>> the same time there are likely cases where it is also just fine to >>> do so. >>> >>> If we separate helper processes from QEMU vCPUs this is not as wasteful >>> as it sounds. Some the helper processes are running trusted code, there >>> is no need for helper processes from different guests to be isolated. >>> They can all just live in the default core scheduling domain. >>> >>> I feel like I'm talking myself into suggesting the core scheduling >>> host knob in qemu.conf needs to be more than just a single boolean. >>> Either have two knobs - one to turn it on/off and one to control >>> whether helpers are split or combined - or have one knob and make >>> it an enumeration. >> >> Seems reasonable. And the default should be QEMU's emulator + vCPU >> threads in one sched group, and all helper processes in another, right? > > Not quite. I'm suggesting that helper processes can remain in the > host's default core scheduling group, since the helpers are all > executing trusted machine code. > >>> One possible complication comes if we consider a guest that is >>> pinned, but not on the fine grained per-vCPU basis. >>> >>> eg if guest is set to allow floating over a sub-set of host CPUs >>> we need to make sure that it is possible to actually execute the >>> guest still. ie if entire guest is pinned to 1 host CPU but our >>> config implies use of 2 distinct core scheduling domains, we have >>> an unsolvable constraint. >> >> Do we? Since we're placing emulator + vCPUs into one group and helper >> processes into another these would never run at the same time, but that >> would be the case anyways - if emulator write()-s into a helper's socket >> it would be blocked because the helper isn't running. This "bottleneck" >> is result of pinning everything onto a single CPU and exists regardless >> of scheduling groups. >> >> The only case where scheduling groups would make the bottleneck worse is >> if emulator and vCPUs were in different groups, but we don't intent to >> allow that. > > Do we actually pin the helper processes at all ? Yes, we do. Into the same CGroup as emulator thread: qemuSetupCgroupForExtDevices(). > > I was thinking of a scenario where we implicitly pin helper processes to > the same CPUs as the emulator threads and/or QEMU process-global pinning > mask. eg > > If we only had > > <vcpu placement='static' cpuset="2-3" current="1">2</vcpu> > > Traditionally the emulator threads, i/o threads, vCPU threads will > all float across host CPUs 2 & 3. I was assuming we also placed > helper processes in these same 2 host CPUs. Not sure if that's right > or not. Assuming we do, then... > > Lets say CPUs 2 & 3 are SMT siblings. > > We have helper processes in the default core scheduling > domain and QEMU in a dedicated core scheduling domain. We > loose 100% of concurrency between the vCPUs and helper > processes. So in this case users might want to have helpers and emulator in the same group. Therefore, in qemu.conf we should allow something like: sched_core = "none" // off, no SCHED_CORE "emulator" // default, place only emulator & vCPU threads // into the group "helpers" // place emulator & vCPU & helpers into the // group I agree that "helpers" is terrible name, maybe "emulator+helpers"? Or something completely different? Maybe: sched_core = [] // off ["emulator"] // enumlator & vCPU threads ["emulator","helpers"] // emulator + helpers We can refine "helpers" in future (if needed) to say "virtiofsd", "dbus", "swtpm" allowing users to fine tune what helper processes are part of the group. Michal