Re: [PATCH] soc: qcom: qmi: Signal the txn completion after releasing the mutex

Sricharan Ramabadhran <quic_srichara@xxxxxxxxxxx> · Wed, 2 Aug 2023 13:37:26 +0530

On 8/2/2023 5:11 AM, Chris Lew wrote:


On 8/1/2023 4:13 AM, Sricharan Ramabadhran wrote:
Hi,

On 8/1/2023 6:06 AM, Chris Lew wrote:


On 7/31/2023 8:19 AM, Pavan Kondeti wrote:
On Mon, Jul 31, 2023 at 06:37:55PM +0530, Praveenkumar I wrote:
txn is in #1 stack

Worker #1                                       Worker #2
********                    *********

qmi_txn_wait(txn)                               qmi_handle_message
    |                                                  |
    |                                                  |
  wait_for_complete(txn->complete)                    ....
    |                                             
mutex_lock(txn->lock)
    |                                                  |
  mutex_lock(txn->lock)                                |
    .....                                         complete(txn->lock)
    | mutex_unlock(txn->lock)
    |
  mutex_unlock(txn->lock)

In this case above, while #2 is doing the mutex_unlock(txn->lock),
in between releasing lock and doing other lock related wakeup, #2 gets
scheduled out. As a result #1, acquires the lock, unlocks, also
frees the txn also (where the lock resides)

Now #2, gets scheduled again and tries to do the rest of the lock
related wakeup, but lock itself is invalid because txn itself is gone.

Fixing this, by doing the mutex_unlock(txn->lock) first and then
complete(txn->lock) in #2

Fixes: 3830d0771ef6 ("soc: qcom: Introduce QMI helpers")
Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Sricharan Ramabadhran <quic_srichara@xxxxxxxxxxx>
Signed-off-by: Praveenkumar I <quic_ipkumar@xxxxxxxxxxx>
---
  drivers/soc/qcom/qmi_interface.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/soc/qcom/qmi_interface.c 
b/drivers/soc/qcom/qmi_interface.c
index 78d7361fdcf2..92e29db97359 100644
--- a/drivers/soc/qcom/qmi_interface.c
+++ b/drivers/soc/qcom/qmi_interface.c
@@ -505,12 +505,13 @@ static void qmi_handle_message(struct 
qmi_handle *qmi,
                  pr_err("failed to decode incoming message\n");
              txn->result = ret;
-            complete(&txn->completion);
          } else  {
              qmi_invoke_handler(qmi, sq, txn, buf, len);
          }
          mutex_unlock(&txn->lock);
+        if (txn->dest && txn->ei)
+            complete(&txn->completion);
      } else {
          /* Create a txn based on the txn_id of the incoming 
message */
          memset(&tmp_txn, 0, sizeof(tmp_txn));

What happens in a remote scenario where the waiter gets timed out at 
the
very same time you are releasing the mutex but before calling
complete()? The caller might end up freeing txn structure and it 
results
in the same issue you are currently facing.

Thanks,
Pavan

I think downstream we had various attempts of moving the signal 
around trying to avoid this, but hit scenarios like the one Pavan 
described.

We eventually settled on removing the txn->lock and treating the 
qmi->txn_lock as a big lock. This remedied the issue where the 
txn->lock goes out of scope since qmi->txn_lock is tied to the qmi 
handle.


  ok agree. Using qmi->txn_lock looks a better approach.
  That said, this race between mutex lock/unlock looks odd though.
  If i remember we saw the issue only with CONFIG_DEBUG_LOCK_ALLOC.
  Was that the same case for you guys as well ?

  Otherwise, ideally handling all members of the object inside lock
  should be the right solution (ie moving the wait_for_complete(txn)
  inside the mutex_lock in qmi_txn_wait. That should take care of the
  scenario that Pavan described too.


No, we saw the issue even without CONFIG_DEBUG_LOCK_ALLOC. The 
callstacks always ended up showing that the mutex could be acquired 
before mutex_unlock() completely finished.

It didn't seem wise to poke at the mutex implementation so we went with 
the txn_lock.

 ok, that's strange. That effectively means, mutex_lock/unlock are not
 working/protecting the critical section ? Then qmi->txn_lock also would
 result in a similar issue ? I guess, in this case, during issue, txn
 (which holds the lock) was going out of context, while still the txn
 was in used in other thread. That effectively shows up a mutex issue
 maybe. While the downstream change to use qmi->txn_lock would fix the
 mutex issue, will have to check if the txn object itself is protected
 correctly.

Regards,
 Sricharan