On Thu, Oct 17, 2024 at 08:50:11AM +0100, Marc Zyngier wrote: > On Thu, 17 Oct 2024 06:23:35 +0100, > Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx> wrote: > > > > On Thu, Jul 28, 2022 at 02:17:01PM +0200, Johan Hovold wrote: > > > On Wed, Jul 27, 2022 at 02:57:16PM -0500, Bjorn Helgaas wrote: > > > > On Tue, Jul 26, 2022 at 11:56:59AM +0200, Johan Hovold wrote: > > > > > On Mon, Jul 25, 2022 at 06:35:27PM +0100, Marc Zyngier wrote: > > > > > > On Mon, 25 Jul 2022 16:18:48 +0100, > > > > > > Johan Hovold <johan@xxxxxxxxxx> wrote: > > > > > > > > > > > > Since when is unloading modules something that is expected to > > > > > > > work perfectly? I keep hearing "well, don't do that then" when > > > > > > > someone complains about unloading this module while doing this > > > > > > > or that broke something. (And it's only root that can unload > > > > > > > modules in the first place.) > > > > > > > > > > > > Well, maybe I have higher standards. For the stuff I maintain, I > > > > > > now point-blank refuse to support module unloading if this can > > > > > > result in a crash. Or worse. > > > > > > > > > > That makes sense for regular interrupt controllers where its hard to > > > > > tell that all consumers are gone, but I don't think that should > > > > > limit the usefulness of having modular PCI controller drivers where > > > > > we know that the consumers are gone after deregistering the bus > > > > > (i.e. the consumers are descendants of the controller in the device > > > > > tree). > > > > > > > > Those consumers are endpoint drivers, so I think this depends on those > > > > drivers correctly unmapping the interrupts they use, right? > > > > > > Right. For MSI this means that pci_alloc_irq_vectors() in probe should > > > be matched by pci_free_irq_vectors() on remove. > > > > > > For legacy interrupts, which can be shared, the mapping is created by > > > PCI core when binding to the first device and can only be disposed by > > > the host-bridge driver once all descendants have been removed. > > > > > > The endpoint drivers still need to disable their interrupts of course. > > > > > > Buggy endpoint-driver remove implementations can lead to all sorts of > > > crashes (e.g. after failing to deregister a class device), and if that's > > > a worry then don't unload modules (and possibly disable it completely > > > using CONFIG_MODULE_UNLOAD). > > > > > > > > > > It's useful for developers, but use it at your own risk. > > > > > > > > > > > > > > That said, I agree that if something is next to impossible to > > > > > > > get right, as may be the case with interrupt controllers > > > > > > > generally, then fine, let's disable module unloading for that > > > > > > > class of drivers. > > > > > > > > > > > > > > And this would mean disabling driver unbind for the 20+ driver > > > > > > > PCI drivers that currently implement it to some degree. > > > > > > > > > > > > That would be Bjorn's and Lorenzo's call. > > > > > > > > > > Sure, but I think it would be the wrong decision here. Especially, > > > > > since the end result will likely just be that more drivers will > > > > > become always compiled-in. > > > > > > > > Can you elaborate on this? I think Marc is suggesting that these PCI > > > > controller drivers be modular but not removable. Why would that cause > > > > more of them to be compiled-in? > > > > > > As mentioned earlier in this thread, we only appear to have some 60 > > > drivers in the entire tree that bother to try to implement that. I fear > > > that blocking the use of modules (including being able to unload them) > > > will just make people submit drivers that can only be built in. > > > > > > Not everyone cares about Android's GKI, but being able to unload a > > > module during development is very useful (and keeping that out-of-tree > > > prevents sharing the implementation and make it susceptible to even > > > further bit rot). > > > > > > So continuing to supporting modules properly is a win for everyone (e.g. > > > GKI and developers). > > > > > > > > > > > > Turns out the pcie-qcom driver does not support legacy > > > > > > > > > interrupts so there's no risk of there being any lingering > > > > > > > > > mappings if I understand things correctly. > > > > > > > > > > > > > > > > It still does MSIs, thanks to dw_pcie_host_init(). If you can > > > > > > > > remove the driver while devices are up and running with MSIs > > > > > > > > allocated, things may get ugly if things align the wrong way > > > > > > > > (if a driver still has a reference to an irq_desc or irq_data, > > > > > > > > for example). > > > > > > > > > > > > > > That is precisely the way I've been testing it and everything > > > > > > > appears to be tore down as it should. > > > > > > > > > > > > > > And a PCI driver that has been unbound should have released its > > > > > > > resources, or that's a driver bug. Right? > > > > > > > > > > > > But that's the thing: you can easily remove part of the > > > > > > infrastructure without the endpoint driver even noticing. It may > > > > > > not happen in your particular case if removing the RC driver will > > > > > > also nuke the endpoints in the process, but I can't see this is an > > > > > > absolute guarantee. The crash pointed to by an earlier email is > > > > > > symptomatic of it. > > > > > > > > > > But that was arguably due to a driver bug, which we know how to fix. > > > > > For MSIs the endpoint driver will free its interrupts and all is > > > > > good. > > > > > > > > > > The key observation is that the driver model will make sure that any > > > > > endpoint drivers have been unbound before the bus is deregistered. > > > > > > > > > > That means there are no longer any consumers of the interrupts, > > > > > which can be disposed. For MSI this is handled by > > > > > pci_free_irq_vectors() when unbinding the endpoint drivers. For > > > > > legacy interrupts, which can be shared, the PCIe RC driver needs to > > > > > manage this itself after the consumers are gone. > > > > > > > > The driver model ensures that endpoint drivers have been unbound. But > > > > doesn't the interrupt disposal depend on the correct functioning of > > > > those endpoint drivers? So if a buggy endpoint driver failed to > > > > dispose of them, we're still vulnerable? > > > > > > Just as you are if an endpoint-driver fails to clean up after itself in > > > some other way (e.g. leaves the interrupt enabled). > > > > > > > The IRQ disposal issue should hopefully fixed by this series: > > https://lore.kernel.org/linux-pci/20240715114854.4792-3-kabel@xxxxxxxxxx/ > > > > Then if the dwc driver calls pci_remove_irq_domain() instead of > > irq_domain_remove(), we can be sure that all the IRQs are disposed during the > > driver remove. > > > > So can we proceed with the series making Qcom driver modular? > > Who is volunteering to fix the drivers that will invariably explode > once we allow this? > Why should anyone volunteer first up? If the issue gets reported for a driver blowing up, then the driver has to be fixed by the maintainer or someone, just like any other bug.