Hi Michal, Thanks for the quick and useful response. On Mon, Mar 25, 2019 at 2:20 PM Michał Kazior <kazikcz@xxxxxxxxx> wrote: > On Mon, 25 Mar 2019 at 21:27, Brian Norris <briannorris@xxxxxxxxxxxx> wrote: > > It would appear that this triggers new warnings > > > > BUG: sleeping function called from invalid context > > > > when handling firmware crashes. The call stack is > > > > ath10k_pci_fw_crashed_dump > > -> ath10k_pci_dump_memory > > ... > > -> ath10k_pci_diag_read_mem > > > > and the problem is that we're holding the 'data_lock' spinlock with > > softirqs disabled, while later trying to grab this new mutex. > > No, the spinlock is not the real problem. The real problem is you're > trying to hold a mutex on a path which is potentially atomic / > non-sleepable: ath10k_pci_napi_poll(). I'll admit here that I've been testing a variety of kernels here (including upstream), and some of them are prior to this commit: 3c97f5de1f28 ath10k: implement NAPI support So this was running in a tasklet, not NAPI polling. But then my understanding was still incorrect: tasklets are also an atomic (softirq) context. Doh. I guess I'd say the problem is "both". > > > Unfortunately, data_lock is used in a lot of places, and it's unclear if > > it can be migrated to a mutex as well. It seems like it probably can be, > > but I'd have to audit a little more closely. > > It can't be migrated to a mutex. It's intended to synchronize top half > with bottom half. It has to be an atomic non-sleeping lock mechanism. Ack, thanks for the correction. > What you need to do is make sure ath10k_pci_diag_read_mem() and > ath10k_pci_diag_write_mem() are never called from an atomic context. I knew that part already :) > For one, you'll need to defer ath10k_pci_fw_crashed_dump to a worker. > Maybe into ar->restart_work which the dump function calls now. Hmm, that's an idea -- although I'm not sure if I'd steal 'restart_work', or create a different work item on the same queue. But either way, we'd still also have to avoid holding 'data_lock', and at that point, I'm not sure if we're losing desirable properties of these firmware dumps -- it allows more "stuff" to keep going on while we're preparing to dump the device memory state. > To get rid of data_lock from ath10k_pci_fw_crashed_dump() you'll need > to at least make fw_crash_counter into an atomic_t. This is just from > a quick glance. Yes, we'd need at least that much. We'd also need some other form of locking to ensure exclusion between all users of ar->coredump.fw_crash_data and similar. At the moment, that's 'data_lock', but I suppose we get a similar exclusion if all the dump/restart work is on the same workqueue. I'm still not quite sure if this is 5.1-rc material, or I should just revert for 5.1. Thanks, Brian