On Fri, 2011-10-07 at 06:29 -0400, Prarit Bhargava wrote: > Frankly this is only useful for developers and _a few_ enterprise > users at this point. Maybe the value needs to be a config value, but > that doesn't explain the increase in complexity of the code Nish is > suggesting. No, it's all our enterprise users since we are shipping SR-IOV capable adapters and Linux trying to mess around with VFs makes them not work. > What we're saying is that the real fix isn't to plunk a bunch of code > which will be dead in a few months when powerpc gets its act together > on SRIOV. Going with a bunch of complex arch-specific code for little > benefit seems overkill here. Well, powerpc itself has no problem with SR-IOV, ie, it will work on all our platforms that don't have a hypervisor :-) The case of pHyp is special since all the resources are under control of the hypervisor and I don't know if will -ever- get its act together. AFAIK, there is no plan at this point for pHyp to support guest OSes in control of PFs creating VFs at this stage but I don't know for sure. So we are in the realm of something that the platform code knows at boot time. Asking the -user- to know about these things (or even care wtf SR-IOV means or is about) is -NOT- the right approach here. Things should "just work" to the extent that these adapters should operate in PF only ode just fine. Now regarding the actual patch itself, I like neither of the proposed ones so far. The sriov=off one looks like something that could have some values for field diagnostics etc... and so might be good to apply regardless. For the pHyp issues the problem goes deeper, which is why just disabling sr-iov automatically on those platforms is the easiest solution for now. Long run, I'd rather see a bunch of the stuff that sriov_init does at probe time to move later, after resources have been allocated, to allow to -not- change the VF count or mess with the page size if the IOV resource has not been allocated and will not be. Also, the blind whacking of the page size is something that I suspect have had very very little testing. The system page size setting in SR-IOV is a strange and not well architected beast. It's -not- just the stride between the VFs. It also tends to impact the layout of various data structures (DMA descriptors etc...) manipulated by the chip (depending on the vendor), and I wouldn't be surprised that changing it just plain breaks out driver (we still don't know exactly what's going on just yet but we know that going through sriov_init, even without requesting VFs later on, already breaks some adapters for us, possibly because of that for example). Cheers, Ben. > [FWIW -- Bjorn's suggestion of pushing all of this into the sriov > struct is the way to go] > > P. > > > Ben. > > > >> > >>>> -----8<------ > >>>> > >>>> [PATCH] Add a kernel parameter that disables SRIOV > >>>> > >>>> Add sriov=[0|1] as a kernel parameter to disable or enable SRIOV. > This is > >>>> useful for debugging SRIOV issues vs. bare-metal issues. > >>>> > >>>> Signed-off-by: Prarit Bhargava<prarit@xxxxxxxxxx> > >>>> --- > >>>> Documentation/kernel-parameters.txt | 4 ++++ > >>>> drivers/pci/iov.c | 30 > +++++++++++++++++++++++++++++- > >>>> 2 files changed, 33 insertions(+), 1 deletions(-) > >>>> > >>>> diff --git a/Documentation/kernel-parameters.txt > b/Documentation/kernel-parameters.txt > >>>> index 854ed5c..235b5c5 100644 > >>>> --- a/Documentation/kernel-parameters.txt > >>>> +++ b/Documentation/kernel-parameters.txt > >>>> @@ -2385,6 +2385,10 @@ bytes respectively. Such letter suffixes > can also be entirely omitted. > >>>> spia_pedr= > >>>> spia_peddr= > >>>> > >>>> + sriov= [HW] > >>>> + Format: [0|1] > >>>> + Enable or disable SRIOV support in kernel. > >>>> + > >>>> stacktrace [FTRACE] > >>>> Enabled the stack tracer on boot up. > >>>> > >>>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > >>>> index 42fae47..da96e3f 100644 > >>>> --- a/drivers/pci/iov.c > >>>> +++ b/drivers/pci/iov.c > >>>> @@ -408,6 +408,19 @@ static void sriov_disable(struct pci_dev > *dev) > >>>> iov->nr_virtfn = 0; > >>>> } > >>>> > >>>> +static unsigned int sriov_enabled = 1; > >>>> +static int __init do_sriov_enable(char *str) > >>>> +{ > >>>> + get_option(&str,&sriov_enabled); > >>>> + if (!sriov_enabled) > >>>> + printk(KERN_INFO "SRIOV: Disabled\n"); > >>>> + else > >>>> + printk(KERN_INFO "SRIOV: Enabled\n"); > >>>> + > >>>> + return 1; > >>>> +} > >>>> +__setup("sriov=", do_sriov_enable); > >>> > >>> Hrm, we only will print the above is sriov= is passed on the > kernel > >>> command-line, right? It seems like maybe we should print it > always. And > >>> perhaps the prefix should be "SR-IOV"? > >>> > >>>> static int sriov_init(struct pci_dev *dev, int pos) > >>>> { > >>>> int i; > >>>> @@ -419,6 +432,12 @@ static int sriov_init(struct pci_dev *dev, > int pos) > >>>> struct resource *res; > >>>> struct pci_dev *pdev; > >>>> > >>>> + if (!sriov_enabled) { > >>>> + rc = -EPERM; > >>>> + /* in theory no flags should have been set ... */ > >>>> + goto failed; > >>>> + } > >>>> + > >>> > >>> Hrm, is this needed if pci_enable_sriov() checks it already? Isn't > that > >>> the only caller? > >>> > >>>> if (dev->pcie_type != PCI_EXP_TYPE_RC_END&& > >>>> dev->pcie_type != PCI_EXP_TYPE_ENDPOINT) > >>>> return -ENODEV; > >>>> @@ -614,7 +633,10 @@ resource_size_t > pci_sriov_resource_alignment(struct pci_dev *dev, int resno) > >>>> struct resource tmp; > >>>> enum pci_bar_type type; > >>>> int reg = pci_iov_resource_bar(dev, resno,&type); > >>>> - > >>>> + > >>>> + if (!sriov_enabled) > >>>> + return 0; > >>>> + > >>>> if (!reg) > >>>> return 0; > >>>> > >>>> @@ -667,6 +689,9 @@ int pci_enable_sriov(struct pci_dev *dev, int > nr_virtfn) > >>>> { > >>>> might_sleep(); > >>>> > >>>> + if (!sriov_enabled) > >>>> + return -EPERM; > >>>> + > >>>> if (!dev->is_physfn) > >>>> return -ENODEV; > >>>> > >>>> @@ -682,6 +707,9 @@ void pci_disable_sriov(struct pci_dev *dev) > >>>> { > >>>> might_sleep(); > >>>> > >>>> + if (!sriov_enabled) > >>>> + return; > >>>> + > >>>> if (!dev->is_physfn) > >>>> return; > >>> > >>> Ah my patch needs these three checks as well, FWIW. > >>> > >>> Thanks, > >>> Nish > >>> > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html