Hi Brian, On Mon, 25 Mar 2019 at 21:27, Brian Norris <briannorris@xxxxxxxxxxxx> wrote: > Hi Kalle, > > On Wed, Feb 06, 2019 at 05:41:43PM -0800, Brian Norris wrote: > > The DIAG copy engine is only used via polling, but it holds a spinlock > > with softirqs disabled. Each iteration of our read/write loops can > > theoretically take 20ms (two 10ms timeout loops), and this loop can be > > run an unbounded number of times while holding the spinlock -- dependent > > on the request size given by the caller. > > > > As of commit 39501ea64116 ("ath10k: download firmware via diag Copy > > Engine for QCA6174 and QCA9377."), we transfer large chunks of firmware > > memory using this mechanism. With large enough firmware segments, this > > becomes an exceedingly long period for disabling soft IRQs. For example, > > with a 500KiB firmware segment, in testing QCA6174A, I see 200 loop > > iterations of about 50-100us each, which can total about 10-20ms. > > > > In reality, we don't really need to block softirqs for this duration. > > The DIAG CE is only used in polling mode, and we only need to hold > > ce_lock to make sure any CE bookkeeping is done without screwing up > > another CE. Otherwise, we only need to ensure exclusion between > > ath10k_pci_diag_{read,write}_mem() contexts. > > > > This patch moves to use fine-grained locking for the shared ce_lock, > > while adding a new mutex just to ensure mutual exclusion of diag > > read/write operations. > > > > Tested on QCA6174A, firmware version WLAN.RM.4.4.1-00132-QCARMSWPZ-1. > > > > Fixes: 39501ea64116 ("ath10k: download firmware via diag Copy Engine for QCA6174 and QCA9377.") > > Signed-off-by: Brian Norris <briannorris@xxxxxxxxxxxx> > > It would appear that this triggers new warnings > > BUG: sleeping function called from invalid context > > when handling firmware crashes. The call stack is > > ath10k_pci_fw_crashed_dump > -> ath10k_pci_dump_memory > ... > -> ath10k_pci_diag_read_mem > > and the problem is that we're holding the 'data_lock' spinlock with > softirqs disabled, while later trying to grab this new mutex. No, the spinlock is not the real problem. The real problem is you're trying to hold a mutex on a path which is potentially atomic / non-sleepable: ath10k_pci_napi_poll(). > Unfortunately, data_lock is used in a lot of places, and it's unclear if > it can be migrated to a mutex as well. It seems like it probably can be, > but I'd have to audit a little more closely. It can't be migrated to a mutex. It's intended to synchronize top half with bottom half. It has to be an atomic non-sleeping lock mechanism. What you need to do is make sure ath10k_pci_diag_read_mem() and ath10k_pci_diag_write_mem() are never called from an atomic context. For one, you'll need to defer ath10k_pci_fw_crashed_dump to a worker. Maybe into ar->restart_work which the dump function calls now. To get rid of data_lock from ath10k_pci_fw_crashed_dump() you'll need to at least make fw_crash_counter into an atomic_t. This is just from a quick glance. Michał