Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Thu, 13 Jul 2017 11:15:50 +0200

On 13/07/2017 10:36, Mihai Donțu wrote:
> On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
>> Worse, KVM is not able to distinguish userspace that has paused the VM
>> from userspace that is doing MMIO or userspace that has a bug and hung
>> somewhere.  And even worse, there are cases where userspace wants to
>> modify registers while doing port I/O (the awful VMware RPCI port).  So
>> I'd rather avoid this.
> 
> I should give more details here: we don't need to pause the vCPU-s in
> the sense widely understood but just prevent them from entering the
> guest for a short period of time. In our particular case, we need this
> when we start introspecting a VM that's already running. For this we
> kick the vCPU-s out of the guest so that our scan of the memory does
> not race with the guest kernel/applications.
> 
> Another use case is when we inject applications into a running guest.
> We also kick the vCPU-s out while we atomically make changes to kernel
> specific structures.

This is not possible to do in KVM, because KVM doesn't control what
happens to the memory outside KVM_RUN (and of course it doesn't control
devices doing DMA).  You need to talk to QEMU in order to do this.

To do atomic changes to kernel specific structures, I would change the
page tables to inaccessible instead, but that also doesn't protect them
from devices doing DMA into them.

Another issue: say a VM is waiting for a reply from the introspector,
and the reply is delayed so the VM gets a signal and needs to get out to
QEMU with EINTR.  I don't think it is always possible to retry the
instruction on the next KVM_RUN, because the introspector might be
making partial changes.  Add live migration to the mix if you want to
make things even more complicated. :)

I think we need a way to mark a set of commands for atomic application.
That is, the structure of the command stream needs to be

    command 1
    command 2
    event reply 1
    transaction end marker
    command 3
    transaction end marker
    command 4
    event reply 2
    transaction end marker

>>> +8. KVMI_GET_MTRR_TYPE
>>> +---------------------
>>
>> What is this used for?  KVM ignores the guest MTRRs, so if possible I'd
>> rather avoid it.
> 
> We use it do identify cacheable memory which usually indicates device
> memory, something we don't want to touch. We are also looking into
> making use of the page attributes (PAT) or other PTE-bits instead of
> MTRR, but for the time being MTRR-s are still being relied upon.

Fair enough.  But you can compute it yourself from the MTRRs, can't you?
 A separate command is just adding attack surface in the hypervisor.

>>> +14. KVMI_INJECT_BREAKPOINT
>>> +--------------------------
>>> +
>>> +:Architectures: all
>>> +:Versions: >= 1
>>> +:Parameters: ↴
>>> +
>>> +::
>>> +
>>> +	struct kvmi_inject_breakpoint {
>>> +		__u16 vcpu;
>>> +		__u16 padding[3];
>>> +	};
>>> +
>>> +:Returns: ↴
>>> +
>>> +::
>>> +
>>> +	struct kvmi_error_code {
>>> +		__s32 err;
>>> +		__u32 padding;
>>> +	};
>>> +
>>> +Injects a breakpoint for the specified vCPU. This command is usually sent in
>>> +response to an event and as such the proper GPR-s will be set with the reply.
>>
>> What is a "breakpoint" in this context?
> 
> A simple INT3. It's what most of our patches consist of. We keep track
> of them and handle the ones we own while pass (reinject) the ones used
> by the guest (debuggers or whatnot).

Why can't they be written with KVMI_READ/WRITE_PHYSICAL?  (I would keep
those two as they provide a more direct interface than map/unmap, and
they work even with introspectors that are not sibling guests of the
introspected VM).

Paolo