Re: [PATCH v2] PCI: IOV: read SRIOV_NUM_VF after enabling ARI

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ben,

Thanks for the detailed response!

On Fri, Oct 16, 2015 at 11:56:28AM -0500, Ben Shelton wrote:
> Hi Bjorn,
> 
> > What problem does this patch solve, Ben?  I assume you have devices
> > that do change TotalVFs when ARI is enabled, and you do want the new
> > value?
> > 
> > Or is the problem something like the following:
> > 
> >   - ...
> >   - Linux PCI core sees TotalVFs = X (saved as iov->total_VFs)
> >   - Linux sets ARI Capable Hierarchy
> >   - Device changes TotalVFs to X + Y (but PCI core doesn't notice)
> >   - Driver reads TotalVFs and sees X + Y
> >   - Driver attempts pci_enable_sriov(dev, X + Y), which fails because
> >     sriov_enable() sees "X + Y > iov->total_VFs"
> 
> Here's a short snippet from the databook for the PCI Express controller we're
> using:
> 
> "Supports two sets of VF Stride, First VF Offset, InitialVFs, and TotalVFs
> registers per PF—one each for ARI and non-ARI hierarchies. Selection is
> performed by host software through the ARI Capable Hierarchy bit of the Control
> register in the PF0 SR-IOV capability structure."
> 
> The values in InitialVFs and TotalVFs are HWinit for each set of registers.

The HwInit description says "bits are read-only after initialization
and can only be reset (for write-once by firmware) with 'Power Good
Reset'."  I don't see any provision for different values based on a
control register bit, so I think this device is actually out of spec.

We should be able to deal with it, so it's not that big a deal, but we
will have to keep it in mind and probably mention it in a comment in
the code.

> So the issue this is intended to fix is the following:
> 
> - Linux PCI core sees TotalVFs = X (saved as iov->total_VFs).
> - Linux sets ARI Capable Hierarchy.
> - Device switches to exposing the second set of registers, where
>   InitialVFs = TotalVFs = Y (where Y > X).
> - User enables one or more VFs on the device, e.g. by writing a value to
>   sriov_numvfs in the sysfs.
> - Driver calls pci_enable_sriov() for the device, which then calls
>   sriov_enable().  sriov_enable() reads InitialVFs (= Y) and then checks if it's
>   greater than iov->total_VFs (= X).  Since Y > X, the comparison is true, so
>   sriov_enable() fails out and returns -EIO.

I think there are two problems here:

  1) We should be reading some registers together to make sure we get
     consistent values.  For example, we always read VFOffset and
     VFStride immediately after writing NumVFs.  I think we should
     read InitialVFs and TotalVFs together.  I don't see the point of
     reading TotalVFs in sriov_init() and reading InitialVFs in
     sriov_enable().  If we read them both in sriov_init(), I don't
     think we'd have this problem of sriov_enable() returning -EIO.

  2) To work around this non-compliant device, we should read InitialVFs
     and TotalVFs after setting the ARI bit.

Ideally, I think this would be two patches: one to move the InitialVFs
read from sriov_enable() to sriov_init(), and a second to move the
pair from before setting ARI to after.

> > I'm a little dubious about drivers reading the SRIOV capability
> > directly, so maybe this is a symptom of deeper problems.
> 
> I agree that the driver should not be reading the capability directly, but from
> what I understand, it's intended for the device itself to do this.  From the PCI
> SR-IOV spec revision 1.1:
> 
> "ARI Capable Hierarchy is a hint to the Device that ARI has been enabled in the
> Root Port or Switch Downstream Port immediately above the Device."

Sure, of course, the device should behave differently based on how the
registers are programmed; that's the whole point of having writable
registers.  I think the particular case of the device changing HwInit
registers is out of spec, but changing things like VFOffset and
VFStride is completely expected.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux