Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Mon, 7 Aug 2017 17:56:43 +0200

On 07/08/2017 16:12, Mihai Donțu wrote:
> On Mon, 2017-08-07 at 15:49 +0200, Paolo Bonzini wrote:
>> On 07/08/2017 15:25, Mihai Donțu wrote:
>>>> "Pause all VCPUs and stop all DMA" would definitely be a layering
>>>> violation, so it cannot be added.
>>>>
>>>> "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with
>>>> a given id" commands.  I lean towards omitting it.
>>>
>>> The case where the introspector wants to scan the guest memory needs a
>>> KVMI_PAUSE_VM, which as discussed in a previous email, can be the
>>> actual qemu 'pause' command.
>>
>> Do you mean it needs to stop DMA as well?
> 
> No, DMA can proceed normally. I remain of the opinion that KVMI users
> must know what guest memory ranges are OK to access by looking at MTRR-
> s, PAT or guest kernel structures, or a combination of all three.

Ok, good.  Sorry if I am dense on the DMA/no-DMA cases.  (But, I don't
understand your remark about guest memory ranges; the point of
bus-master DMA is that it works on any memory, and cache snooping makes
it even easier for hypothetical malware to do memory writes via
bus-master DMA).

>>> However, we would like to limit the
>>> communication channels we have with the host and not use qmp (or
>>> libvirt/etc. if qmp is not exposed). Instead, have a command that
>>> triggers a KVM_RUN exit to qemu which in turn will call the underlying
>>> pause function used by qmp. Would that be OK with you?
>>
>> You would have to send back something on completion, and then I am
>> worried of races and deadlocks.  Plus, pausing a VM at the QEMU level is
>> a really expensive operation, so I don't think it's a good idea to let
>> the introspector do this.  You can pause all VCPUs, or use memory page
>> permissions.
> 
> Pausing all vCPU-s was my first thought, I was just trying to follow
> your statement: "I lean towards omitting it". :-)

Yes, and I still do because a hypothetical "pause all VCPUs" command
still has the issue that you could get other events before the command
completes.  So I am not convinced that a specialized command actually
makes the introspector code much simpler.

I hope you understand that I want to keep the trusted base (not just the
code I maintain, though that is a secondary benefit ;)) as simple as
possible.

> It will take a bit of user-space-fu, in that after issuing N vCPU pause
> commands in a row we will have to wait for N events, which might race
> with other events (MSR, CRx etc.) which need handling otherwise the
> pause ones will not arrive

The same issue would be there in QEMU or KVM though.

If you can always request "pause all vCPUs" from an event handler,
avoiding deadlocks is relatively easy.  If you cannot ensure that, for
example because of work that is scheduled periodically, you can send a
KVM_PAUSE command to ensure the work is done in a safe condition.

Then you get the following pseudocode algorithm:

    // a vCPU is not executing guest code, and it's going to check
    // num_pause_vm_requests before going back to guest code
    vcpu_not_running(id) {
        unmark vCPU "id" as running
        if (num vcpus running == 0)
            cond_broadcast(no_running_cpus)
    }

    pause_vcpu(id) {
        mark vCPU "id" as being-paused
        send KVMI_PAUSED for the vcpu
    }

    // return only when no vCPU is in KVM_RUN
    pause_vm() {
        if this vCPU is running
            if not in an event handler
                // caller should do pause_vcpu and defer the work
                return

            // we know this vCPU is not KVM_RUN
            vcpu_not_running()

        num_pause_vm_requests++
        if (num vcpus running > 0)
            for each vCPU that is running and not being-paused
                pause_vcpu(id)
            while (num vcpus running > 0)
                cond_wait(no_running_vcpus)
    }

    // tell paused vCPUs that they can resume
    resume_vm() {
        num_pause_vm_requests--
        if (num_pause_all_requests == 0)
            cond_broadcast(no_pending_pause_vm_requests)
        // either we're in an event handler, or a "pause" command was
        // sent for this vCPU.  in any case we're guaranteed to do an
        // event_reply sooner or later, which will again mark the vCPU
        // as running
    }

    // after an event reply, the vCPU goes back to KVM_RUN.  therefore
    // an event reply can act as a synchronization point for pause-vm
    // requests: delay the reply if there's such a request
    event_reply(id, data) {
        if (num_pause_vm_requests > 0) {
            if vCPU "id" is running
                vcpu_not_running(id)
            while (num_pause_vm_requests > 0)
                cond_wait(no_pending_pause_vm_requests)
        }
        mark vCPU "id" as running
        send event reply on KVMI socket
    }

    // this is what you do when KVM tells you that the guest is either
    // in userspace, or waiting to be woken up ("paused" event).  from
    // the introspector POV the two are the same.
    vcpu_ack_pause(id) {
        vcpu_not_running(id)
        unmark vCPU "id" as being-paused

        // deferred work presumably calls pause_vm/resumve_vm, and this
        // vCPU is not running now, so this is a nice point to flush it
        if any deferred work exists, do it now
    }

and on the KVMI read handler:

    on reply to "pause" command:
        if reply says the vCPU is currently in userspace
            // we'll get a KVMI_PAUSED event as soon as the host
            // reenters KVM with KVM_RUN, but we can already say the
            // CPU is not running
            vcpu_ack_pause()

    on "paused" event:
        vcpu_ack_pause()
        event_reply()

Paolo