Re: [PATCH 3/3] soundwire: qcom: add wake up interrupt support

Srinivas Kandagatla <srinivas.kandagatla@xxxxxxxxxx> · Wed, 23 Feb 2022 16:22:01 +0000

On 23/02/2022 00:31, Pierre-Louis Bossart wrote:

On 2/22/22 16:52, Srinivas Kandagatla wrote:

On 22/02/2022 19:26, Pierre-Louis Bossart wrote:

+static irqreturn_t qcom_swrm_wake_irq_handler(int irq, void *dev_id)
+{
+    struct qcom_swrm_ctrl *swrm = dev_id;
+    int ret = IRQ_HANDLED;
+    struct sdw_slave *slave;
+
+    clk_prepare_enable(swrm->hclk);
+
+    if (swrm->wake_irq > 0) {
+        if (!irqd_irq_disabled(irq_get_irq_data(swrm->wake_irq)))
+            disable_irq_nosync(swrm->wake_irq);
+    }
+
+    /*
+     * resume all the slaves which must have potentially generated this
+     * interrupt, this should also wake the controller at the same
time.
+     * this is much safer than waking controller directly that will
deadlock!
+     */
There should be no difference if you first resume the controller and
then attached peripherals, or do a loop where you rely on the pm_runtime
framework.

The notion that there might be a dead-lock is surprising, you would need
to elaborate here.Issue is, if wakeup interrupt resumes the controller
first which can
trigger an slave pending interrupt (ex: Button press event) in the
middle of resume that will try to wake the slave device which in turn
will try to wake parent in the middle of resume resulting in a dead lock.

This was the best way to avoid dead lock.

Not following, sorry. if you use pm_runtime functions and it so happens
that the resume already started, then those routines would wait for the
resume to complete.
yes that is true,

TBH, I was trying to reproduce the issue since morning to collect some 
traces but no luck so far, I hit these issues pretty much rarely. Now 
code has changed since few months back am unable to reproduce this 
anymore. Or it might be just the state of code I had while writing this up.

But when I hit the issue, this is how it looks like:

1. IRQ Wake interrupt resume parent.

2. parent is in middle of resuming

3. Slave Pend interrupt in controller fired up

4. because of (3) child resume is requested and then the parent resume 
blocked on (2) to finish.

5. from (2) we also trying to resume child.

6. (5) is blocked on (4) to finish which is blocked on (2) to finish

we are dead locked. Only way for me to avoid dead lock was to wake the 
child up after IRQ wake interrupts.

here is the stack trace of blocked-tasks from sysrq

root@linaro-gnome:~# [  182.327220] sysrq: Show Blocked State
[  182.331063] task:irq/20-soundwir state:D stack:    0 pid:  445 ppid: 
    2 flags:0x00000008
[  182.339655] Call trace:
[  182.342176]  __switch_to+0x168/0x1b8
[  182.345864]  __schedule+0x2a8/0x880
[  182.349459]  schedule+0x54/0xf0
[  182.352700]  rpm_resume+0xc4/0x550
[  182.356211]  rpm_resume+0x348/0x550
[  182.359805]  rpm_resume+0x348/0x550
[  182.363400]  __pm_runtime_resume+0x48/0xb8
[  182.367616]  sdw_handle_slave_status+0x1f8/0xf80
[  182.372371]  qcom_swrm_irq_handler+0x5c4/0x6f0
[  182.376942]  irq_thread_fn+0x2c/0xa0
[  182.380626]  irq_thread+0x16c/0x288
[  182.384221]  kthread+0x11c/0x128
[  182.387549]  ret_from_fork+0x10/0x20
[  182.391231] task:irq/187-swr_wak state:D stack:    0 pid:  446 ppid: 
    2 flags:0x00000008
[  182.399819] Call trace:
[  182.402339]  __switch_to+0x168/0x1b8
[  182.406019]  __schedule+0x2a8/0x880
[  182.409614]  schedule+0x54/0xf0
[  182.412854]  rpm_resume+0xc4/0x550
[  182.416363]  rpm_resume+0x348/0x550
[  182.419957]  rpm_resume+0x348/0x550
[  182.423552]  __pm_runtime_resume+0x48/0xb8
[  182.427767]  swrm_runtime_resume+0x98/0x3d0
[  182.432079]  pm_generic_runtime_resume+0x2c/0x48
[  182.436832]  __rpm_callback+0x44/0x190
[  182.440693]  rpm_callback+0x6c/0x78
[  182.444289]  rpm_resume+0x2f0/0x550
[  182.447883]  __pm_runtime_resume+0x48/0xb8
[  182.452099]  qcom_swrm_wake_irq_handler+0x20/0x128
[  182.457033]  irq_thread_fn+0x2c/0xa0
[  182.460712]  irq_thread+0x16c/0x288
[  182.464306]  kthread+0x11c/0x128
[  182.467634]  ret_from_fork+0x10/0x20

As am unable to reproduce this issue anymore so I will remove the code 
dealing with slaves directly for now till we are able to really 
reproduce the issue.

In other words, there can be multiple requests to resume, but only the
*first* request will trigger a transition and others will just increase
a refcount.

In addition, the pm_runtime framework guarantees that the peripheral
device can only start resuming when the parent controller device is
fully resumed.

While I am at it, one thing that kept us busy as well is the
relationship between system suspend and pm_runtime suspend. In the
generic case a system suspend will cause a pm_runtime resume before you
can actually start the system suspend, but you might be able to skip
this step. In the Intel case when the controller and its parent device
were suspended we had to pm_runtime resume everything because some
registers were no longer accessible.
Interesting, thanks for the hints, will keep that in mind.

--srini