On Fri, Jan 17, 2025, Yosry Ahmed wrote: > On Fri, Jan 17, 2025 at 10:01 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > Yep. I suspect the issue is lack of documentation for TLB_FLUSH_GUEST and > > TLB_FLUSH_CURRENT. I'm not entirely sure where it would be best to document > > them. I guess maybe where they are #defined? > > I guess at the #define we can just mention that they result in calling > kvm_vcpu_flush_tlb_{guest/current}() before entering the guest, if > anything. Yeah, a "See xx for details" redirect is probably the best option. > The specific documentation about what they do could be above the > functions themselves, and describing the potential MMU sync is > naturally part of documenting kvm_vcpu_flush_tlb_guest() (kinda > already there). > > The flush_tlb_guest() callback is documented in kvm_host.h, but not > flush_tlb_current(). I was going to suggest just documenting that. But > kvm_vcpu_flush_tlb_guest() does not only call flush_tlb_guest(), but > it also potentially synchronizes the MMU. So only documenting the > callbacks does not paint a full picture. > > FTR, I initially confused myself because all kvm_vcpu_flush_tlb_*() > functions are more-or-less thin wrappers around the per-vendor > callbacks -- except kvm_vcpu_flush_tlb_guest(). > > > > > TLB_FLUSH_GUEST is used when a flush of the guest's TLB, from the guest's > > perspective, is architecturally required. The one oddity with TLB_FLUSH_GUEST > > is that it does NOT include guest-physical mappings, i.e. TLB entries that are > > associated with an EPT root. > > The way I think about this is how it's documented above the per-vendor > callback. It flushes translations created by the guest. The guest does > not (directly) create guest-physical translations, only linear and > combined translations. That's not accurate either. When L1 is using nested TDP, it does create guest- physical translations. The lack of any form of handling in TLB_FLUSH_GUEST is a reflection of two things: EPT is weird, and nested SVM doesn't yet support precise flushing on transitions, i.e. nested NPT handling is missing because KVM unconditionally flushes and synchronizes. EPT is "weird" because the _only_ time guest-physical translations are flushed is when the "wrong" KVM MMU is loaded. The only way to flush guest-physical translations (short of RESET :-D) is via INVEPT, and INVEPT is a root-only (VMX terminology) instruction, i.e. can only be executed by L1. And because L1 can't itself be using EPT[*], INVEPT can never target/flush the current context. Furthermore, INVEPT isn't strictly tied to a VMCS, e.g. deferring the emulated flush until the next time KVM runs a vmcs12 isn't viable. Rather than add dedicated tracking, KVM simply unloads the roots and lets the normal root "allocation" handle the flush+sync the next time the vCPU uses the associated MMU. Nested NPT is different, as there is no INVNPT. Instead, there's the ASID itself and a flushing control, both of which are properties of the VMCB. As a result, NPT TLB flushes that are initiated by a hypervisor always take effect at VMRUN, e.g. by bumping the ASID, or via the dedicated flushing control. So when proper handling of TLB flushing on nested SVM transition comes along, I do expect that either kvm_vcpu_flush_tlb_guest() will grow. Or maybe we'll add yet another TLB_FLUSH_XXX flavor :-) One thing that could be helpful would be to document that KVM doesn't use TLB_FLUSH_GUEST to handle INVEPT, and so there's no need to sync nested TDP MMUs. [*] Even in a deprivileged scenario like pKVM, the guest kernel would become L2 from KVM's perspective.