On Mon, 2024-04-15 at 14:17 -0700, Sean Christopherson wrote: > > But doesn't the fault handler need the vCPU state? > > Ignoring guest MTRRs, which will hopefully soon be a non-issue, no. There are > only six possible roots if TDP is enabled: > > 1. 4-level !SMM !guest_mode > 2. 4-level SMM !guest_mode > 3. 5-level !SMM !guest_mode > 4. 5-level SMM !guest_mode > 5. 4-level !SMM guest_mode > 6. 5-level !SMM guest_mode > > 4-level vs. 5-level is a guest MAXPHYADDR thing, and swapping the MMU > eliminates > the SMM and guest_mode issues. If there is per-vCPU state that makes its way > into > the TDP page tables, then we have problems, because it means that there is > per-vCPU > state in per-VM structures that isn't accounted for. > > There are a few edge cases where KVM treads carefully, e.g. if the fault is to > the vCPU's APIC-access page, but KVM manually handles those to avoid consuming > per-vCPU state. > > That said, I think this option is effectively 1b, because dropping the SMM vs. > guest_mode state has the same uAPI problems as forcibly swapping the MMU, it's > just a different way of doing so. > > The first question to answer is, do we want to return an error or "silently" > install mappings for !SMM, !guest_mode. And so this option becomes relevant > only > _if_ we want to unconditionally install mappings for the 'base" mode. Ah, I thought there was some logic around CR0.CD. > > > > - Return error on guest mode or SMM mode: Without this patch. > > > Pros: No additional patch. > > > Cons: Difficult to use. > > > > Hmm... For the non-TDX use cases this is just an optimization, right? For > > TDX > > there shouldn't be an issue. If so, maybe this last one is not so horrible. > > And the fact there are so variables to control (MAXPHADDR, SMM, and > guest_mode) > basically invalidates the argument that returning an error makes the ioctl() > hard > to use. I can imagine it might be hard to squeeze this ioctl() into QEMU's > existing code, but I don't buy that the ioctl() itself is hard to use. > > Literally the only thing userspace needs to do is set CPUID to implicitly > select > between 4-level and 5-level paging. If userspace wants to pre-map memory > during > live migration, or when jump-starting the guest with pre-defined state, simply > pre-map memory before stuffing guest state. In and of itself, that doesn't > seem > difficult, e.g. at a quick glance, QEMU could add a hook somewhere in > kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge > disclaimer that I only know enough about how QEMU manages vCPUs to be > dangerous). > > I would describe the overall cons for this patch versus returning an error > differently. Switching MMU state puts the complexity in the kernel. > Returning > an error punts any complexity to userspace. Specifically, anything that KVM > can > do regarding vCPU state to get the right MMU, userspace can do too. > > Add on that silently doing things that effectively ignore guest state usually > ends badly, and I don't see a good argument for this patch (or any variant > thereof). Great.