Re: [PATCH] ath10k: pci: use mutex for diagnostic window CE polling

Brian Norris <briannorris@xxxxxxxxxxxx> · Mon, 25 Mar 2019 15:14:05 -0700

Hi Michal,

Thanks for the quick and useful response.

On Mon, Mar 25, 2019 at 2:20 PM Michał Kazior <kazikcz@xxxxxxxxx> wrote:
> On Mon, 25 Mar 2019 at 21:27, Brian Norris <briannorris@xxxxxxxxxxxx> wrote:
> > It would appear that this triggers new warnings
> >
> >   BUG: sleeping function called from invalid context
> >
> > when handling firmware crashes. The call stack is
> >
> >   ath10k_pci_fw_crashed_dump
> >     -> ath10k_pci_dump_memory
> >     ...
> >       -> ath10k_pci_diag_read_mem
> >
> > and the problem is that we're holding the 'data_lock' spinlock with
> > softirqs disabled, while later trying to grab this new mutex.
>
> No, the spinlock is not the real problem. The real problem is you're
> trying to hold a mutex on a path which is potentially atomic /
> non-sleepable: ath10k_pci_napi_poll().

I'll admit here that I've been testing a variety of kernels here
(including upstream), and some of them are prior to this commit:

3c97f5de1f28 ath10k: implement NAPI support

So this was running in a tasklet, not NAPI polling. But then my
understanding was still incorrect: tasklets are also an atomic
(softirq) context. Doh.

I guess I'd say the problem is "both".

>
> > Unfortunately, data_lock is used in a lot of places, and it's unclear if
> > it can be migrated to a mutex as well. It seems like it probably can be,
> > but I'd have to audit a little more closely.
>
> It can't be migrated to a mutex. It's intended to synchronize top half
> with bottom half. It has to be an atomic non-sleeping lock mechanism.

Ack, thanks for the correction.

> What you need to do is make sure ath10k_pci_diag_read_mem() and
> ath10k_pci_diag_write_mem() are never called from an atomic context.

I knew that part already :)

> For one, you'll need to defer ath10k_pci_fw_crashed_dump to a worker.
> Maybe into ar->restart_work which the dump function calls now.

Hmm, that's an idea -- although I'm not sure if I'd steal
'restart_work', or create a different work item on the same queue. But
either way, we'd still also have to avoid holding 'data_lock', and at
that point, I'm not sure if we're losing desirable properties of these
firmware dumps -- it allows more "stuff" to keep going on while we're
preparing to dump the device memory state.

> To get rid of data_lock from ath10k_pci_fw_crashed_dump() you'll need
> to at least make fw_crash_counter into an atomic_t. This is just from
> a quick glance.

Yes, we'd need at least that much. We'd also need some other form of
locking to ensure exclusion between all users of
ar->coredump.fw_crash_data and similar. At the moment, that's
'data_lock', but I suppose we get a similar exclusion if all the
dump/restart work is on the same workqueue.

I'm still not quite sure if this is 5.1-rc material, or I should just
revert for 5.1.

Thanks,
Brian