On 30 July 2013 20:35, Kalle Valo <kvalo@xxxxxxxxxxxxxxxx> wrote: > Michal Kazior <michal.kazior@xxxxxxxxx> writes: > >> There was a slight race during PCI shutdown. Since >> interrupts weren't really stopped (only Copy >> Engine interrupts were disabled through device hw >> registers) it was possible for a firmware >> indication (crash) interrupt to come in after >> tasklets were synced/killed. This would cause >> memory corruption and a panic in most cases. It >> was also possible for interrupt to come before CE >> was initialized during device probing. >> >> Interrupts are required for BMI phase so they are enabled as soon as >> power_up() is called but are freed upon both power_down() and stop() >> so there's asymmetry here. As by design stop() cannot be followed by >> start() it is okay. Both power_down() and stop() should be merged >> later on to avoid confusion. > > Why are the interrupts freed both in power_down() and stop()? I don't > get that. > > What if we call disable_irq() in power_down() instead? power_down() must call free_irq(), because power_up() calls request_irq() (if you want the symmetry). If anything, the stop() should call disable_irq(), but wouldn't that mean start() should call enable_irq()? But than, irqs are needed before start().. I did think about disable_irq() but AFAIR you need to enable_irq() later on (so either way you need to keep track of the irq state or you'll get a ton of WARN_ONs from the system). I'll double check that and report back later >> Before this can be really properly fixed var/hw >> init code split is necessary. >> >> Signed-off-by: Michal Kazior <michal.kazior@xxxxxxxxx> >> --- >> >> Please note: this is based on my (still under >> review at the time of posting) previous patchests: >> device setup refactor and recovery. >> >> I'm posting this before those patchsets are merged >> so anyone interested in testing this fix (I can't >> reproduce the problem on my setup) can give it a >> try. > > This was reported by Ben, right? So this sould have a Reported-by line > attributing him. Yes. I'll fix that, provided we get through the review with the patch :) >> @@ -1783,16 +1792,24 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar) >> return 0; >> >> err_ce: >> + /* XXX: Until var/hw init is split it's impossible to fix the ordering >> + * here so we must call stop_intr() here too to prevent interrupts after >> + * CE is teared down. It's okay to double call the stop_intr() >> */ > > "FIXME:" Ok. >> exit: >> + ar_pci->intr_started = ret == 0; > > A bit too clever for the sake of readibility for my taste, but I guess > it's ok. > >> --- a/drivers/net/wireless/ath/ath10k/pci.h >> +++ b/drivers/net/wireless/ath/ath10k/pci.h >> @@ -198,6 +198,7 @@ struct ath10k_pci { >> * interrupts. >> */ >> int num_msi_intrs; >> + bool intr_started; > > Adding a new state variable makes me worried. I really would prefer a > solution which would not require that. I know that. That's why I mentioned in the commit log that it is more of a workaround than a real fix. Me, I don't like this either but a real fix requires a lot of rework from what I can tell. This bug can be triggered more easily now apparently after recovery patches went in. I'm not experiencing it but I get reports of rare panics when a machine is left idle for a very long time with interfaces brought down. > Also if we call request_irq() in ath10k_pci_probe() we should also call > free_irq() in ath10k_pci_remove() for symmetry. Just doing a temporary > hack will most likely stay forever :) With the patch interrupts are temporarily enabled&disabled for probe_fw() during pci_probe() and are then not requested until mac80211 start(). Pozdrawiam / Best regards, Michał Kazior. -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html