On Mon, May 13, 2024, Michael Roth wrote: > On Mon, May 13, 2024 at 04:48:25PM -0700, Sean Christopherson wrote: > > Actually, I take that back, this isn't even an optimization, it's literally a > > non-generic implementation of kvm_run.immediate_exit. > > Relying on a generic -EINTR response resulting from kvm_run.immediate_exit > doesn't seem like a very robust way to ensure the attestation request > was made to firmware. It seems fully possible that future code changes > could result in EINTR being returned for other reasons. So how do you > reliably detect that the EINTR is a result of immediate_exit being called > after the attestation request is made to firmware? We could squirrel something > away in struct kvm_run to probe for, but delivering another > KVM_EXIT_SNP_REQ_CERT with an extra flag set seems to be reasonably > userspace-friendly. And unnecessarily specific to a single exit. But it's a non-issue (except possibly on ARM). I doubt it's formally documented anywhere, but userspace absolutely relies on kvm_run.immediate_exit to be processed _after_ complete_userspace_io(). If KVM exits with -EINTR before invoking cui(), live migration will break due to taking a snapshot of vCPU state in the middle of an instruction. Given that userspace has likely built up rigid expectations for immediate_exit, I don't see any problem formally documenting KVM's behavior, i.e. signing a contract guaranteeing that KVM will complete the "back half" of emulation if immediate_exit is set and KVM_RUN return -EINTR. ARM is the only arch that is at all suspect, due to its rather massive kvm_arch_vcpu_run_pid_change() hook. At a quick glance, it seems to be ok, too. And if it's not, we need to fix that asap, because it's like a bug waiting to happen. > > If this were an optimization, i.e. KVM truly notified userspace without exiting, > > then it would need to be a lot more robust, e.g. to ensure userspace actually > > received the notification before KVM moved on. > > Right, this does rely on exiting via , not userspace polling for flags or > anything along that line. > > > > > > + __u8 flags; > > > + #define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING 0 > > > + #define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE 1 > > > > This is also a weird reimplementation of generic functionality. KVM nullifies > > vcpu->arch.complete_userspace_io _before_ invoking the callback. So if a callback > > needs to run again on the next KVM_RUN, it can simply set complete_userspace_io > > again. In other words, literally doing nothing will get you what you want :-) > > We could just have the completion callback set complete_userspace_io > again, but then you'd always get 2 userspace exit events per attestation > request. There could be some userspaces that don't implement the > file-locking scheme, in which case they wouldn't need the 2nd notification. Then they don't set immediate_exit. > That's why the KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE flag is provided > as an opt-in. > > The pending/done status bits are so userspace can distinguish between the > start of a certificate request and the completion side of it after it gets > bound a completed attestation request and the filelock can be released.