Re: [RFC PATCH] KVM: Introduce KVM VIRTIO device

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Thu, 21 Dec 2023 10:44:11 +0800

On Wed, Dec 20, 2023 at 06:12:55PM -0800, Sean Christopherson wrote:
> On Wed, Dec 20, 2023, Yan Zhao wrote:
> > On Tue, Dec 19, 2023 at 12:26:45PM +0800, Yan Zhao wrote:
> > > On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote:
> > > > > > Implementation Consideration
> > > > > > ===
> > > > > > There is a previous series [1] from google to serve the same purpose to
> > > > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series
> > > > > > requires a new memslot flag, and special memslots in user space.
> > > > > > 
> > > > > > We don't choose to use memslot flag to request honoring guest memory
> > > > > > type.
> > > > > 
> > > > > memslot flag has the potential to restrict the impact e.g. when using
> > > > > clflush-before-read in migration?
> > > > 
> > > > Yep, exactly.  E.g. if KVM needs to ensure coherency when freeing memory back to
> > > > the host kernel, then the memslot flag will allow for a much more targeted
> > > > operation.
> > > > 
> > > > > Of course the implication is to honor guest type only for the selected slot
> > > > > in KVM instead of applying to the entire guest memory as in previous series
> > > > > (which selects this way because vmx_get_mt_mask() is in perf-critical path
> > > > > hence not good to check memslot flag?)
> > > > 
> > > > Checking a memslot flag won't impact performance.  KVM already has the memslot
> > > > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has
> > > > access to the memslot.
> > > > 
> > > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g.
> > > > to retrieve the associated PFN, update write-tracking for shadow pages, etc.
> > > > 
> > > Hi Sean,
> > > Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC?
> > > For KVM_MEM_DMA, KVM needs to
> > > (a) search VMA for vma->vm_page_prot and convert it to page cache mode (with
> > >     pgprot2cachemode()? ), or
> > > (b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in
> > >     pat_pfn_immune_to_uc_mtrr().
> > > 
> > > But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now.
> > > 
> > > For KVM_MEM_WC, it requires user to ensure the memory is actually mapped
> > > to WC, right?
> > > 
> > > Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type
> > > for the special memslot only, as below.
> > > Is this understanding correct?
> > > 
> > > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > > {
> > >         if (is_mmio)                                                                           
> > >                 return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;                          
> > >                                                                                                
> > >         if (gfn_in_dma_slot(vcpu->kvm, gfn)) {                                                 
> > >                 u8 type = MTRR_TYPE_WRCOMB;                                      
> > >                 //u8 type = pat_pfn_memtype(pfn);                                
> > >                 return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;       
> > >         }                                                                                      
> > >                                                                                                
> > >         if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))                            
> > >                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;         
> > >                                                                                                
> > >         if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {                                             
> > >                 if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))               
> > >                         return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;                      
> > >                 else                                                                           
> > >                         return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | 
> > >                                 VMX_EPT_IPAT_BIT;                                
> > >         }                                                                        
> > >                                                                                  
> > >         return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> > > }
> > > 
> > > BTW, since the special memslot must be exposed to guest as virtio GPU BAR in
> > > order to prevent other guest drivers from access, I wonder if it's better to
> > > include some keyword like VIRTIO_GPU_BAR in memslot flag name.
> > Another choice is to add a memslot flag KVM_MEM_HONOR_GUEST_PAT, then user
> > (e.g. QEMU) does special treatment to this kind of memslots (e.g. skipping
> > reading/writing to them in general paths).
> > 
> > @@ -7589,26 +7589,29 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> >         if (is_mmio)
> >                 return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> > 
> > +       if (in_slot_honor_guest_pat(vcpu->kvm, gfn))
> > +               return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> 
> This is more along the lines of what I was thinking, though the name should be
> something like KVM_MEM_NON_COHERENT_DMA, i.e. not x86 specific and not contradictory
> for AMD (which already honors guest PAT).
> 
> I also vote to deliberately ignore MTRRs, i.e. start us on the path of ripping
> those out.  This is a new feature, so we have the luxury of defining KVM's ABI
> for that feature, i.e. can state that on x86 it honors guest PAT, but not MTRRs.
> 
> Like so?
Yes, this looks good to me. Will refine the patch in the recommended way.
Only one remaining question from me is if we could allow user to set the flag
to every slot.
If user is allowed to set the flag to system ram slot, it may cause problem,
since guest drivers (other than the paravirt one) may have chance to get
allocated memory in this ram slot and make its PAT setting take effect, which
is not desired.

Do we need to find a way to limit the eligible slots?
e.g. check VMAs of the slot to ensure its vm_flags include VM_IO and VM_PFNMAP.

Or, just trust user?

> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d21f55f323ea..ed527acb2bd3 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7575,7 +7575,8 @@ static int vmx_vm_init(struct kvm *kvm)
>         return 0;
>  }
>  
> -static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio,
> +                         struct kvm_memory_slot *slot)
>  {
>         /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
>          * memory aliases with conflicting memory types and sometimes MCEs.
> @@ -7598,6 +7599,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>         if (is_mmio)
>                 return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
>  
> +       if (kvm_memslot_has_non_coherent_dma(slot))
> +               return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> +
>         if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
>                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> 
> I like the idea of pulling the memtype from the host, but if we can make that
> work then I don't see the need for a special memslot flag, i.e. just do it for
> *all* SPTEs on VMX.  I don't think we need a VMA for that, e.g. we should be able
> to get the memtype from the host PTEs, just like we do the page size.
Right.
Besides, after reading host PTEs, we still need to decode the bits to memtype in
platform specific way and convert the type to EPT type.

> 
> KVM_MEM_WC is a hard "no" for me.  It's far too x86 centric, and as you alluded
> to, it requires coordination from the guest, i.e. is effectively limited to
> paravirt scenarios.
Got it!
> 
> > +
> >         if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> >                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> > 
> >         if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
> >                 if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> >                         return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> >                 else
> >                         return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
> >                                 VMX_EPT_IPAT_BIT;
> >         }
> > 
> >         return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> >  }