Hi Ben, Thanks for the detailed response! On Fri, Oct 16, 2015 at 11:56:28AM -0500, Ben Shelton wrote: > Hi Bjorn, > > > What problem does this patch solve, Ben? I assume you have devices > > that do change TotalVFs when ARI is enabled, and you do want the new > > value? > > > > Or is the problem something like the following: > > > > - ... > > - Linux PCI core sees TotalVFs = X (saved as iov->total_VFs) > > - Linux sets ARI Capable Hierarchy > > - Device changes TotalVFs to X + Y (but PCI core doesn't notice) > > - Driver reads TotalVFs and sees X + Y > > - Driver attempts pci_enable_sriov(dev, X + Y), which fails because > > sriov_enable() sees "X + Y > iov->total_VFs" > > Here's a short snippet from the databook for the PCI Express controller we're > using: > > "Supports two sets of VF Stride, First VF Offset, InitialVFs, and TotalVFs > registers per PF—one each for ARI and non-ARI hierarchies. Selection is > performed by host software through the ARI Capable Hierarchy bit of the Control > register in the PF0 SR-IOV capability structure." > > The values in InitialVFs and TotalVFs are HWinit for each set of registers. The HwInit description says "bits are read-only after initialization and can only be reset (for write-once by firmware) with 'Power Good Reset'." I don't see any provision for different values based on a control register bit, so I think this device is actually out of spec. We should be able to deal with it, so it's not that big a deal, but we will have to keep it in mind and probably mention it in a comment in the code. > So the issue this is intended to fix is the following: > > - Linux PCI core sees TotalVFs = X (saved as iov->total_VFs). > - Linux sets ARI Capable Hierarchy. > - Device switches to exposing the second set of registers, where > InitialVFs = TotalVFs = Y (where Y > X). > - User enables one or more VFs on the device, e.g. by writing a value to > sriov_numvfs in the sysfs. > - Driver calls pci_enable_sriov() for the device, which then calls > sriov_enable(). sriov_enable() reads InitialVFs (= Y) and then checks if it's > greater than iov->total_VFs (= X). Since Y > X, the comparison is true, so > sriov_enable() fails out and returns -EIO. I think there are two problems here: 1) We should be reading some registers together to make sure we get consistent values. For example, we always read VFOffset and VFStride immediately after writing NumVFs. I think we should read InitialVFs and TotalVFs together. I don't see the point of reading TotalVFs in sriov_init() and reading InitialVFs in sriov_enable(). If we read them both in sriov_init(), I don't think we'd have this problem of sriov_enable() returning -EIO. 2) To work around this non-compliant device, we should read InitialVFs and TotalVFs after setting the ARI bit. Ideally, I think this would be two patches: one to move the InitialVFs read from sriov_enable() to sriov_init(), and a second to move the pair from before setting ARI to after. > > I'm a little dubious about drivers reading the SRIOV capability > > directly, so maybe this is a symptom of deeper problems. > > I agree that the driver should not be reading the capability directly, but from > what I understand, it's intended for the device itself to do this. From the PCI > SR-IOV spec revision 1.1: > > "ARI Capable Hierarchy is a hint to the Device that ARI has been enabled in the > Root Port or Switch Downstream Port immediately above the Device." Sure, of course, the device should behave differently based on how the registers are programmed; that's the whole point of having writable registers. I think the particular case of the device changing HwInit registers is out of spec, but changing things like VFOffset and VFStride is completely expected. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html