On 13/07/2017 10:36, Mihai Donțu wrote: > On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote: >> Worse, KVM is not able to distinguish userspace that has paused the VM >> from userspace that is doing MMIO or userspace that has a bug and hung >> somewhere. And even worse, there are cases where userspace wants to >> modify registers while doing port I/O (the awful VMware RPCI port). So >> I'd rather avoid this. > > I should give more details here: we don't need to pause the vCPU-s in > the sense widely understood but just prevent them from entering the > guest for a short period of time. In our particular case, we need this > when we start introspecting a VM that's already running. For this we > kick the vCPU-s out of the guest so that our scan of the memory does > not race with the guest kernel/applications. > > Another use case is when we inject applications into a running guest. > We also kick the vCPU-s out while we atomically make changes to kernel > specific structures. This is not possible to do in KVM, because KVM doesn't control what happens to the memory outside KVM_RUN (and of course it doesn't control devices doing DMA). You need to talk to QEMU in order to do this. To do atomic changes to kernel specific structures, I would change the page tables to inaccessible instead, but that also doesn't protect them from devices doing DMA into them. Another issue: say a VM is waiting for a reply from the introspector, and the reply is delayed so the VM gets a signal and needs to get out to QEMU with EINTR. I don't think it is always possible to retry the instruction on the next KVM_RUN, because the introspector might be making partial changes. Add live migration to the mix if you want to make things even more complicated. :) I think we need a way to mark a set of commands for atomic application. That is, the structure of the command stream needs to be command 1 command 2 event reply 1 transaction end marker command 3 transaction end marker command 4 event reply 2 transaction end marker >>> +8. KVMI_GET_MTRR_TYPE >>> +--------------------- >> >> What is this used for? KVM ignores the guest MTRRs, so if possible I'd >> rather avoid it. > > We use it do identify cacheable memory which usually indicates device > memory, something we don't want to touch. We are also looking into > making use of the page attributes (PAT) or other PTE-bits instead of > MTRR, but for the time being MTRR-s are still being relied upon. Fair enough. But you can compute it yourself from the MTRRs, can't you? A separate command is just adding attack surface in the hypervisor. >>> +14. KVMI_INJECT_BREAKPOINT >>> +-------------------------- >>> + >>> +:Architectures: all >>> +:Versions: >= 1 >>> +:Parameters: ↴ >>> + >>> +:: >>> + >>> + struct kvmi_inject_breakpoint { >>> + __u16 vcpu; >>> + __u16 padding[3]; >>> + }; >>> + >>> +:Returns: ↴ >>> + >>> +:: >>> + >>> + struct kvmi_error_code { >>> + __s32 err; >>> + __u32 padding; >>> + }; >>> + >>> +Injects a breakpoint for the specified vCPU. This command is usually sent in >>> +response to an event and as such the proper GPR-s will be set with the reply. >> >> What is a "breakpoint" in this context? > > A simple INT3. It's what most of our patches consist of. We keep track > of them and handle the ones we own while pass (reinject) the ones used > by the guest (debuggers or whatnot). Why can't they be written with KVMI_READ/WRITE_PHYSICAL? (I would keep those two as they provide a more direct interface than map/unmap, and they work even with introspectors that are not sibling guests of the introspected VM). Paolo