Re: [PATCH rfc] vfio-pci: Allow write combining

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2 Aug 2024 08:53:08 -0300
Jason Gunthorpe <jgg@xxxxxxxx> wrote:

> On Thu, Aug 01, 2024 at 12:16:57PM -0600, Alex Williamson wrote:
> > On Thu, 1 Aug 2024 14:53:39 -0300
> > Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> >   
> > > On Thu, Aug 01, 2024 at 11:33:44AM -0600, Alex Williamson wrote:  
> > > > On Thu, 1 Aug 2024 14:13:55 -0300
> > > > Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> > > >     
> > > > > On Thu, Aug 01, 2024 at 10:52:18AM -0600, Alex Williamson wrote:    
> > > > > > > > vfio_region_info.flags in not currently tested for input therefore this
> > > > > > > > proposal could lead to unexpected behavior for a caller that doesn't
> > > > > > > > currently zero this field.  It's intended as an output-only field.        
> > > > > > > 
> > > > > > > Perhaps a REGION_INFO2 then?
> > > > > > > 
> > > > > > > I still think per-request is better than a global flag      
> > > > > > 
> > > > > > I don't understand why we'd need a REGION_INFO2, we already have
> > > > > > support for defining new regions.      
> > > > > 
> > > > > It is not a new region, it is a modified mmap behavior for an existing
> > > > > region.    
> > > > 
> > > > If we're returning a different offset into the vfio device file from
> > > > which to get a WC mapping, what's the difference?     
> > > 
> > > I think it is a pretty big difference.. The offset is just a "mmap
> > > cookie", it doesn't have to be 1:1 with the idea of a region.
> > >   
> > > > A vfio "region" is
> > > > describing a region or range of the vfio device file descriptor.    
> > > 
> > > I'm thinking a region is describing an area of memory that is
> > > available in the VFIO device. The offset output is just a "mmap
> > > cookie" to tell userspace how to mmap it. Having N mmap cookies for 1
> > > region is OK.  
> > 
> > Is an "mmap cookie" an offset into the vfio device file where mmap'ing
> > that offset results in a WC mapping to a specific device resource?  
> 
> Yes
> 
> > Isn't that just a region that doesn't have an index or supporting
> > infrastructure?  
> 
> No? It is a "mmap cookie" that has the requested mmap-time properties.
> 
> Today the kernel side binds the mmap offset to the index in a computed
> way, but from a API perspective userspace does REGION_INFO and gets
> back an opaque "mmap cookie" that it uses to pass to mmap to get back
> the thing describe by REGION_INFO. Userspace has no idea about any
> structure to the cookie numbers.

REGION_INFO is part of what I'm referring to as infrastructure.
Userspace knows a number of regions, ie. indexes, via DEVICE_INFO and
iterates those regions using REGION_INFO to get the mmap cookie, ie.
file offset.  The ABI is flexible here, the first few region indexes
are static mappings to vfio-pci specific device resources and
additional mappings are described as device specific resources using
the REGION_INFO capability chain.

> > > Which controls what NC/WC the mmap creates when called:
> > > 
> > > +	if (vdev->bar_write_combine[index])
> > > +		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
> > > +	else
> > > +		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > > 
> > > You get the same output from REGION_INFO, same number of regions.  
> > 
> > It was your proposal that introduced REQ_WC, this is Keith's original
> > proposal.  I'm equating a REQ_WC request inventing an "mmap cookie" as
> > effectively the same as bringing a lightweight region into existence
> > because it defines a section of the vfio device file to have specific
> > mmap semantics.  
> 
> Well, again, it is not a region, it is just a record that this mmap
> cookie uses X region with Y mapping flags. The number of regions don't
> change. Critically from a driver perspective the number of regions and
> region indexes wouldn't change.

Why is this critical?  As above, for vfio-pci devices the first few
region indexes are static mappings to specific PCI resources and
additional resources beyond VFIO_PCI_NUM_REGIONS are device specific
and described by a region type in the capability chain.

None of the static regions move and it's part of the defined ABI for
userspace to iterate additional regions.

> I am not thing of making a new region, I am thinking of adjusting how
> mmap works. Like today we do this:
> 
> 	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> 
> So maybe instead it is something like:
> 
>       vfio_decode_mmap_cookie(vma, &index, &flags);
>       if (flags & VFIO_MMAP_WC)
>           prot = pgprot_writecombine(..)
> 
> From a driver perspective the region indexes don't change at all. This
> is what I think is important here.

The only thing that changes is that num_regions increases and the user
can go check REGION_INFO on those newly reported regions.  The way we
currently calculate the index from the vm_pgoff is implementation, not
uAPI.  We can change that if necessary.  Userspace should always get
the offset, ie. mmap cookie, from REGION_INFO and the initial set of
region indexes to device resources are fixed, they will not change.

> > > It was the other proposal from long ago that created more regions.
> > > 
> > > This is what I like and would prefer to stick with. REGION_INFO
> > > doesn't really change, we don't have two regions refering to the same
> > > physical memory, and we find some way to request NC/WC of a region at
> > > mmap time.  
> > 
> > "At mmap time" means that something in the vma needs to describe to us
> > to use the WC semantics, where I think you're proposing that the "mmap
> > cookie" provides a specific vm_pgoff which we already use to determine
> > the region index.    
> 
> Yes
> 
> > So whether or not we want to call this a region,
> > it's effectively in the same address space as regions.  Therefore "mmap
> > cookie" ~= "region offset".  
> 
> Well, that is just the current implementation. What we did in RDMA
> when we switched from hard coded mmap cookies to dynamic ones is
> use an xarray (today this should be a maple tree) to dynamically
> allocate mmap cookies whenever the driver returns something to
> userspace. During the mmap fop the pgoff is fed back through the maple
> tree to get the description of what the cookie represents.

Sure, we could do that too, the current implementation (not uAPI) just
uses some upper bits to create fixed region address spaces.  The only
thing we should need to keep consistent is the mapping of indexes to
device resources up through VFIO_PCI_NUM_REGIONS.

> So the encoding of cookies is completely disjoint from whatever the
> underlying thing is. If you want the same region to be mapped with two
> or three different prot flags you just ask for two or three cookies
> and at mmap time you can recover the region pointer and the mmap
> flags.
> 
> So VFIO could do several different things here to convay the mmap
> flags through the cookie, including somehow encoding it in a pgoff
> bit, or using a dynamic maple tree scheme.
> 
> My point is to not confuse the pgoff encoding with the driver concept
> of a region. The region is a single peice of memory, the "mmap cookie"s
> are just handles to it. Adding more data to the handle is not the same
> as adding more regions.

I don't get it.  Take for instance PCI config space.  Given the right
GPU, I can get to config space through an I/O port region, an MMIO
region (possibly multiple ways), and the config space region itself.
Therefore based on this hardware implementation there is no unique
mapping that says that config space is uniquely accessible via a single
region.  Each of these regions has different semantics.  If the layout
of the device can cause this, why do we restrict ourselves that a given
BAR can only be accessed via a signle region and we need to play games
with terminology to call it an mmap cookie rather than officially
creating a region with WC mmap semantics?

> > > > At the limit they're the same.  We could use a
> > > > DEVICE_FEATURE to ask vfio to selectively populate WC regions after
> > > > which the user could re-enumerate additional regions, or in fact to
> > > > switch on WC for a given region if we want to go that route.  Thanks,    
> > > 
> > > This is still adding more regions and reporting more stuff from
> > > REGION_INFO, that is what I would like to avoid.  
> > 
> > Why?  This reminds me of hidden registers outside of capability chains
> > in PCI config space.  Thanks,  
> 
> The mmap offsets are not (supposed to be) ABI in the VFIO ioctls. The
> encoding is entirely opaque inside the kernel already. Apps are
> supposed to use REGION_INFO to learn the value to pass to mmap. ie
> things like VFIO_PCI_OFFSET_SHIFT are not in the uAPI header.

Exactly, which gives us the flexibility to add new regions as
necessary.  The mechanism by which userspace iterates regions is
already provided in the uABI.  The initial set of static vfio-pci
regions require fixed indexes, but not fixed offsets.
VFIO_PCI_OFFSET_SHIFT is only an internal implementation detail.  Thanks

Alex





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux