Re: [PATCH v1 4/5] KVM: Introduce KVM_EXIT_COCO exit type

Dionna Amalie Glaze <dionnaglaze@xxxxxxxxxx> · Fri, 1 Nov 2024 13:53:26 -0700

On Mon, Oct 28, 2024 at 11:20 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Fri, Sep 13, 2024, Dionna Amalie Glaze wrote:
> > We can extend the ccp driver to, on extended guest request, lock the
> > command buffer, get the REPORTED_TCB, complete the request, unlock the
> > command buffer, and return both the response and the REPORTED_TCB at
> > the time of the request.
>
> Holding a lock across an exit to userspace seems wildly unsafe.

I wasn't suggesting this. I was suggesting adding a special ccp symbol
that would perform two sev commands under the same lock to ensure we
know the REPORTED_TCB that was used to derive the VCEK that signs an
attestation report in the MSG_REPORT_REQ guest request. We use that
atomicity to be sure that when we exit to user space to request
certificates that we're getting the right version certificates.

>
> Can you explain the race that you are trying to close, with the exact "bad" sequence
> of events laid out in chronological order, and an explanation of why the race can't
> be sovled in userspace?  I read through your previous comment[*] (which I assume
> is the race you want to close?), but I couldn't quite piece together exactly what's
> broken.

1. the control plane delivers a firmware update. Current TCB version
goes up. The machine signals that it needs new certificates before it
can commit.
2. VM performs an extended guest request.
3. KVM exits to user space to get certificates before getting the
report from firmware.
4. [what I understand Michael Roth was suggesting] User space grabs a
file lock to see if it can read the cached certificates. It reads the
certificates and releases the lock before returning to KVM.
5. the control plane delivers the certificates to the machine and
tells it to commit. The machine grabs the certificate file lock, runs
SNP_COMMIT, and releases the file lock. This command updates both
COMMITTED_TCB and REPORTED_TCB.
6. KVM asks firmware to complete the MSG_REPORT_REQ request, but it's
a different REPORTED_TCB.
7. Guest receives the wrong certificates for certifying the report it
just received.

The fact that 4 has to release the lock before getting the attestation
report is the problem.
If we instead get the report and know what the REPORTED_TCB was when
serving that request, then we can exit to user space requesting the
certificates for the report in hand.
A concurrent update can update the reported_tcb like in the above
scenario, but it won't interfere with certificates since the machine
should have certificates for both TCB_VERSIONs to provide until the
commit is complete.

I don't think it's workable to have 1 grab the file lock and for 5 to
release it. Waiting for a service to update stale certificates should
not block user attestation requests. It would make 4's failure to get
the lock return VMM_BUSY and eventually cause attestations to time out
in sev-guest.

>
> [*] https://lore.kernel.org/all/CAAH4kHb03Una2kcvyC3W=1ZfANBWF_7a7zsSmWhr_r9g3rCDZw@xxxxxxxxxxxxxx

-- 
-Dionna Glaze, PhD, CISSP, CCSP (she/her)