Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection

Mihai Donțu <mdontu@xxxxxxxxxxxxxxx> · Mon, 07 Aug 2017 19:44:03 +0300

On Mon, 2017-08-07 at 17:56 +0200, Paolo Bonzini wrote:
> On 07/08/2017 16:12, Mihai Donțu wrote:
> > On Mon, 2017-08-07 at 15:49 +0200, Paolo Bonzini wrote:
> > > On 07/08/2017 15:25, Mihai Donțu wrote:
> > > > > "Pause all VCPUs and stop all DMA" would definitely be a layering
> > > > > violation, so it cannot be added.
> > > > > 
> > > > > "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with
> > > > > a given id" commands.  I lean towards omitting it.
> > > > 
> > > > The case where the introspector wants to scan the guest memory needs a
> > > > KVMI_PAUSE_VM, which as discussed in a previous email, can be the
> > > > actual qemu 'pause' command.
> > > 
> > > Do you mean it needs to stop DMA as well?
> > 
> > No, DMA can proceed normally. I remain of the opinion that KVMI users
> > must know what guest memory ranges are OK to access by looking at MTRR-
> > s, PAT or guest kernel structures, or a combination of all three.
> 
> Ok, good.  Sorry if I am dense on the DMA/no-DMA cases.

I think it's OK to restate things, especially since my (our) view on
these matters might not match the KVM reality that you know far
better.

> (But, I don't understand your remark about guest memory ranges; the point of
> bus-master DMA is that it works on any memory, and cache snooping makes
> it even easier for hypothetical malware to do memory writes via
> bus-master DMA).

This is where I separated things in my head: I merely limited myself to
accessing memory, while leaving the reality of DMA-based attacks a
problem to be solved separately. There is some reasearch work being
tested internally on that, but I have yet to get in touch with the
people involved in it. As soon as I get some details maybe we can
connect something in KVM.

> > > > However, we would like to limit the
> > > > communication channels we have with the host and not use qmp (or
> > > > libvirt/etc. if qmp is not exposed). Instead, have a command that
> > > > triggers a KVM_RUN exit to qemu which in turn will call the underlying
> > > > pause function used by qmp. Would that be OK with you?
> > > 
> > > You would have to send back something on completion, and then I am
> > > worried of races and deadlocks.  Plus, pausing a VM at the QEMU level is
> > > a really expensive operation, so I don't think it's a good idea to let
> > > the introspector do this.  You can pause all VCPUs, or use memory page
> > > permissions.
> > 
> > Pausing all vCPU-s was my first thought, I was just trying to follow
> > your statement: "I lean towards omitting it". :-)
> 
> Yes, and I still do because a hypothetical "pause all VCPUs" command
> still has the issue that you could get other events before the command
> completes.  So I am not convinced that a specialized command actually
> makes the introspector code much simpler.
> 
> I hope you understand that I want to keep the trusted base (not just the
> code I maintain, though that is a secondary benefit ;)) as simple as
> possible.
> 
> > It will take a bit of user-space-fu, in that after issuing N vCPU pause
> > commands in a row we will have to wait for N events, which might race
> > with other events (MSR, CRx etc.) which need handling otherwise the
> > pause ones will not arrive
> 
> The same issue would be there in QEMU or KVM though.
> 
> If you can always request "pause all vCPUs" from an event handler,
> avoiding deadlocks is relatively easy.  If you cannot ensure that, for
> example because of work that is scheduled periodically, you can send a
> KVM_PAUSE command to ensure the work is done in a safe condition.
> 
> Then you get the following pseudocode algorithm:
> 
>     // a vCPU is not executing guest code, and it's going to check
>     // num_pause_vm_requests before going back to guest code
>     vcpu_not_running(id) {
>         unmark vCPU "id" as running
>         if (num vcpus running == 0)
>             cond_broadcast(no_running_cpus)
>     }
> 
>     pause_vcpu(id) {
>         mark vCPU "id" as being-paused
>         send KVMI_PAUSED for the vcpu
>     }
> 
>     // return only when no vCPU is in KVM_RUN
>     pause_vm() {
>         if this vCPU is running
>             if not in an event handler
>                 // caller should do pause_vcpu and defer the work
>                 return
> 
>             // we know this vCPU is not KVM_RUN
>             vcpu_not_running()
> 
>         num_pause_vm_requests++
>         if (num vcpus running > 0)
>             for each vCPU that is running and not being-paused
>                 pause_vcpu(id)
>             while (num vcpus running > 0)
>                 cond_wait(no_running_vcpus)
>     }
> 
>     // tell paused vCPUs that they can resume
>     resume_vm() {
>         num_pause_vm_requests--
>         if (num_pause_all_requests == 0)
>             cond_broadcast(no_pending_pause_vm_requests)
>         // either we're in an event handler, or a "pause" command was
>         // sent for this vCPU.  in any case we're guaranteed to do an
>         // event_reply sooner or later, which will again mark the vCPU
>         // as running
>     }
> 
>     // after an event reply, the vCPU goes back to KVM_RUN.  therefore
>     // an event reply can act as a synchronization point for pause-vm
>     // requests: delay the reply if there's such a request
>     event_reply(id, data) {
>         if (num_pause_vm_requests > 0) {
>             if vCPU "id" is running
>                 vcpu_not_running(id)
>             while (num_pause_vm_requests > 0)
>                 cond_wait(no_pending_pause_vm_requests)
>         }
>         mark vCPU "id" as running
>         send event reply on KVMI socket
>     }
> 
>     // this is what you do when KVM tells you that the guest is either
>     // in userspace, or waiting to be woken up ("paused" event).  from
>     // the introspector POV the two are the same.
>     vcpu_ack_pause(id) {
>         vcpu_not_running(id)
>         unmark vCPU "id" as being-paused
> 
>         // deferred work presumably calls pause_vm/resumve_vm, and this
>         // vCPU is not running now, so this is a nice point to flush it
>         if any deferred work exists, do it now
>     }
> 
> and on the KVMI read handler:
> 
>     on reply to "pause" command:
>         if reply says the vCPU is currently in userspace
>             // we'll get a KVMI_PAUSED event as soon as the host
>             // reenters KVM with KVM_RUN, but we can already say the
>             // CPU is not running
>             vcpu_ack_pause()
> 
>     on "paused" event:
>         vcpu_ack_pause()
>         event_reply()

Thank you for this!

-- 
Mihai Donțu