On 27/07/2017 18:23, Mihai Donțu wrote: > On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote: >> On 13/07/2017 10:36, Mihai Donțu wrote: >>> On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote: >>>> Worse, KVM is not able to distinguish userspace that has paused the VM >>>> from userspace that is doing MMIO or userspace that has a bug and hung >>>> somewhere. And even worse, there are cases where userspace wants to >>>> modify registers while doing port I/O (the awful VMware RPCI port). So >>>> I'd rather avoid this. >>> >>> I should give more details here: we don't need to pause the vCPU-s in >>> the sense widely understood but just prevent them from entering the >>> guest for a short period of time. In our particular case, we need this >>> when we start introspecting a VM that's already running. For this we >>> kick the vCPU-s out of the guest so that our scan of the memory does >>> not race with the guest kernel/applications. >>> >>> Another use case is when we inject applications into a running guest. >>> We also kick the vCPU-s out while we atomically make changes to kernel >>> specific structures. >> >> This is not possible to do in KVM, because KVM doesn't control what >> happens to the memory outside KVM_RUN (and of course it doesn't control >> devices doing DMA). You need to talk to QEMU in order to do this. > > Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on > the already opened file descriptor to /dev/kvm for an event? Nope. QEMU might be running and writing to memory in another thread. I don't see how this can be reliable on other hypervisors too, actually. >> To do atomic changes to kernel specific structures, I would change the >> page tables to inaccessible instead, but that also doesn't protect them >> from devices doing DMA into them. > > If we have qemu pull out of the guest all vCPU-s and wait for a sign > from the KVMI subsystem, then that looks sufficient. Devices acessing > the memory (passedthrough devices, I assume) should be no problem as > we're never interested in device memory. You're certainly interested in bus-master DMA from those devices though. >> Another issue: say a VM is waiting for a reply from the introspector, >> and the reply is delayed so the VM gets a signal and needs to get out to >> QEMU with EINTR. I don't think it is always possible to retry the >> instruction on the next KVM_RUN, because the introspector might be >> making partial changes. Add live migration to the mix if you want to >> make things even more complicated. :) >> >> I think we need a way to mark a set of commands for atomic application. >> That is, the structure of the command stream needs to be >> >> command 1 >> command 2 >> event reply 1 >> transaction end marker >> command 3 >> transaction end marker >> command 4 >> event reply 2 >> transaction end marker > > This should be covered by a previous email exchange. Correct. >>>>> +8. KVMI_GET_MTRR_TYPE >>>>> +--------------------- >>>> >>>> What is this used for? KVM ignores the guest MTRRs, so if possible I'd >>>> rather avoid it. >>> >>> We use it do identify cacheable memory which usually indicates device >>> memory, something we don't want to touch. We are also looking into >>> making use of the page attributes (PAT) or other PTE-bits instead of >>> MTRR, but for the time being MTRR-s are still being relied upon. >> >> Fair enough. But you can compute it yourself from the MTRRs, can't you? >> A separate command is just adding attack surface in the hypervisor. > > I think we can make some basic MTRR info available via GET_REGISTERS > and do the rest in the introspection tool. Ok. >>>>> +Injects a breakpoint for the specified vCPU. This command is usually sent in >>>>> +response to an event and as such the proper GPR-s will be set with the reply. >>>> >>>> What is a "breakpoint" in this context? >>> >>> A simple INT3. It's what most of our patches consist of. We keep track >>> of them and handle the ones we own while pass (reinject) the ones used >>> by the guest (debuggers or whatnot). >> >> Why can't they be written with KVMI_READ/WRITE_PHYSICAL? (I would keep >> those two as they provide a more direct interface than map/unmap, and >> they work even with introspectors that are not sibling guests of the >> introspected VM). > > They can, nothing is stopping that. Also, we can keep the plain > read/write interfaces around. It just seemed easier to implement them > on top of an eventual mmap/munmap interface. I prefer to keep the simple interface and drop the breakpoint one. Paolo