Re: [RFC PATCH] pci: add hook for architectures to disble SR-IOV at runtime

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> sriov_init() gets called first. It gathers the resource size and resource
> flags information.  pci_bios_allocate_resource() uses that information to allocate resource
> and in case of power, verify the allocations done by the firmware. Since the code
> in pcibios_allocate_resources() does not validate the IOV BARs, the res->flags
> of those IOV BARS continue to stay enabled, which later leads to conflicts.
> The key here is to reset res->flags for any BAR when conflict is detected.

No, you should never do that.

For example, take a device with a memory BAR which has a conflict. If
you reset the flags, you essentially "hide" that BAR from Linux,
potentially allows Linux to enable memory (or IO) decoding on the
device ... but the BAR is still there with it's conflicting value, and
will potentially cause hell on your bus.

You cannot enable memory or IO decoding on a device if any corresponding
BAR has conflicts. You can't just "hide" the conflicting BAR.

SR-IOV is to some extent similar. If the IOV BAR isn't properly
allocated, it must not be hidden, but instead marked as such
(res->parent == NULL is how you detect a non-allocated BAR in generic
code).

> > We still also need to make sure pci_enable_device() doesn't rely on that
> > resource being properly allocated.
> 
> hmm.. pci_enable_device()'s job is to enable all the resources used by the
> device. SRIOV BARs; after all, is a BAR belonging to the PF, not the VF. agree?

Not exactly. It's not necessary for the PFs to operate.

The main question is whether it's -decoded- by the device, and that's
why we should probably -not- leave the device with a numVF set to non-0
until after we know we have a properly allocated VF BAR (and if not,
leave numVF to 0).
 
> > Finally, keep in mind that the way we read that resource to begin with
> > is buggy (yet another piece of code written with only x86 in mind and
> > happily stuck into a generic place):
> > 
> > We do a __pci_read_base() from the IOV code. This is broken on quite a
> > few non-x86 archs such as powerpc:
> > 
> > For us, MMIO addresses from the CPU space aren't mapped 1:1 to
> > corresponding PCI addresses. They need to be translated. This is done at
> > probe time from a header quirk for the "normal" BARs. The generic code
> > doesn't do it for IOV. It should at the very least use
> > pcibios_bus_to_resource() to convert it.
> 
> ok.  will explore this. 

Note also, that with the other kind of constraints I mentioned, such as
the need that we will have to adjust the page size using an EEH specific
algorithm on powernv (under OPAL) etc... I think the right way to do
things is:

 - sriov_init() (probe time) reads in all the information for the device
(sizes the IOV BAR, etc...), collects everything it needs in the iov
structure (including the "preferred" system page size setting), but
leaves the device back in a state where system page size is the original
setting and numVF is 0.

 - The VF BAR then may or may not be "allocated" (or re-assigned etc...)

 - Later, either when the PF tries to enable the first VF, or at some
other stage at the end of the resource survey (though the later would
need a new arch hook), the device IOV functionality gets configured if
(and only if) the VF BAR has been successfully assigned & allocated (ie
res->parent non NULL). At that point, we can then set numVF, and apply
the page size setting. HOWEVER, we shouldn't -calculate- them there, we
should just apply the values from the iov structure. That way, arch
code, during the platform/arch-specific resource survey, can decide to
use a different page size calculation algorithm, or change the number of
supported VFs etc... based on platform considerations, such  as the need
to apply a "special" stride or limit the number of VFs due to resource
constraints (max number of PEs for EEH etc...).

IE, This is the same basic scheme used for everything else on PCI, we
first read things, leaving devices mostly alone, then there's a resource
allocation/assignment phase, then the resources are actually used &
devices enabled.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux