On Mon, Oct 28, 2024 at 11:20 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Fri, Sep 13, 2024, Dionna Amalie Glaze wrote: > > We can extend the ccp driver to, on extended guest request, lock the > > command buffer, get the REPORTED_TCB, complete the request, unlock the > > command buffer, and return both the response and the REPORTED_TCB at > > the time of the request. > > Holding a lock across an exit to userspace seems wildly unsafe. I wasn't suggesting this. I was suggesting adding a special ccp symbol that would perform two sev commands under the same lock to ensure we know the REPORTED_TCB that was used to derive the VCEK that signs an attestation report in the MSG_REPORT_REQ guest request. We use that atomicity to be sure that when we exit to user space to request certificates that we're getting the right version certificates. > > Can you explain the race that you are trying to close, with the exact "bad" sequence > of events laid out in chronological order, and an explanation of why the race can't > be sovled in userspace? I read through your previous comment[*] (which I assume > is the race you want to close?), but I couldn't quite piece together exactly what's > broken. 1. the control plane delivers a firmware update. Current TCB version goes up. The machine signals that it needs new certificates before it can commit. 2. VM performs an extended guest request. 3. KVM exits to user space to get certificates before getting the report from firmware. 4. [what I understand Michael Roth was suggesting] User space grabs a file lock to see if it can read the cached certificates. It reads the certificates and releases the lock before returning to KVM. 5. the control plane delivers the certificates to the machine and tells it to commit. The machine grabs the certificate file lock, runs SNP_COMMIT, and releases the file lock. This command updates both COMMITTED_TCB and REPORTED_TCB. 6. KVM asks firmware to complete the MSG_REPORT_REQ request, but it's a different REPORTED_TCB. 7. Guest receives the wrong certificates for certifying the report it just received. The fact that 4 has to release the lock before getting the attestation report is the problem. If we instead get the report and know what the REPORTED_TCB was when serving that request, then we can exit to user space requesting the certificates for the report in hand. A concurrent update can update the reported_tcb like in the above scenario, but it won't interfere with certificates since the machine should have certificates for both TCB_VERSIONs to provide until the commit is complete. I don't think it's workable to have 1 grab the file lock and for 5 to release it. Waiting for a service to update stale certificates should not block user attestation requests. It would make 4's failure to get the lock return VMM_BUSY and eventually cause attestations to time out in sev-guest. > > [*] https://lore.kernel.org/all/CAAH4kHb03Una2kcvyC3W=1ZfANBWF_7a7zsSmWhr_r9g3rCDZw@xxxxxxxxxxxxxx -- -Dionna Glaze, PhD, CISSP, CCSP (she/her)