Hi Kalle, On Wed, Feb 06, 2019 at 05:41:43PM -0800, Brian Norris wrote: > The DIAG copy engine is only used via polling, but it holds a spinlock > with softirqs disabled. Each iteration of our read/write loops can > theoretically take 20ms (two 10ms timeout loops), and this loop can be > run an unbounded number of times while holding the spinlock -- dependent > on the request size given by the caller. > > As of commit 39501ea64116 ("ath10k: download firmware via diag Copy > Engine for QCA6174 and QCA9377."), we transfer large chunks of firmware > memory using this mechanism. With large enough firmware segments, this > becomes an exceedingly long period for disabling soft IRQs. For example, > with a 500KiB firmware segment, in testing QCA6174A, I see 200 loop > iterations of about 50-100us each, which can total about 10-20ms. > > In reality, we don't really need to block softirqs for this duration. > The DIAG CE is only used in polling mode, and we only need to hold > ce_lock to make sure any CE bookkeeping is done without screwing up > another CE. Otherwise, we only need to ensure exclusion between > ath10k_pci_diag_{read,write}_mem() contexts. > > This patch moves to use fine-grained locking for the shared ce_lock, > while adding a new mutex just to ensure mutual exclusion of diag > read/write operations. > > Tested on QCA6174A, firmware version WLAN.RM.4.4.1-00132-QCARMSWPZ-1. > > Fixes: 39501ea64116 ("ath10k: download firmware via diag Copy Engine for QCA6174 and QCA9377.") > Signed-off-by: Brian Norris <briannorris@xxxxxxxxxxxx> It would appear that this triggers new warnings BUG: sleeping function called from invalid context when handling firmware crashes. The call stack is ath10k_pci_fw_crashed_dump -> ath10k_pci_dump_memory ... -> ath10k_pci_diag_read_mem and the problem is that we're holding the 'data_lock' spinlock with softirqs disabled, while later trying to grab this new mutex. Unfortunately, data_lock is used in a lot of places, and it's unclear if it can be migrated to a mutex as well. It seems like it probably can be, but I'd have to audit a little more closely. Any thoughts on what the short- and long-term solutions should be? I can send a revert, to get v5.1 fixed. But it still seems like we should avoid disabling softirqs for so long. Brian