Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer

Hemant Kumar <hemantk@xxxxxxxxxxxxxx> · Thu, 2 Sep 2021 14:49:37 -0700

Hi Paul,

On 8/31/2021 10:17 PM, Paul Davey wrote:
On Fri, 2021-08-27 at 16:51 +1200, Paul Davey wrote:
On Thu, 2021-08-26 at 09:54 -0600, Jeffrey Hugo wrote:
On 8/23/2021 12:47 AM, Paul Davey wrote:
Hi Hemant, Jeffery

I have some more information after some testing.

Do you have a log which prints the TRE being processed?
Basically i
am
trying understand this : by the time you get double free
issue,
is
there
any pattern with respect to the TRE that is being processed.
For
example
when host processed the given TRE for the first time with
RP1,
stale
TRE
was posted by Event RP2 right after RP1

->RP1 [TRE1]
->RP2 [TRE1]

or occurrence of stale TRE event is random?

[...]

Secondly we saw the following pattern in completion events:

mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
7c4004e0
mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
7c400520
mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
7c4004c0
mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
7c4004b0
mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr:
7c4004a0

Here we can see that instead of a completion event for 7c4004d0 we
have
one for 7c400520 which is significantly ahead of the other point and
from the list of TREs I store in mhi_gen_tre I suspect that 7c400520
is
the next TRE to be used in the TRE ring at this time, as the other
information shows it would be the oldest entry in that list.  I am
not
sure what could have caused this but this is a different case to the
modem repeating the same TRE in a completion event.

I have considered the code further, and while I have seen cases of
identical TRE completion events occurring, I do not think these result
in the double free case because if the event is actually the same as
the last one then the new dev_rp which parse_xfer_event will attempt to
advance the local_rp to will already be equal to the local_rp and the
whole loop will be skipped in the first place.  The troublesome
behaviour comes from the case I describe above where a jump to a
"future" TRE's completion event seems to occur followed by a
continuation of the order, this results in the tre_ring rp being
advanced to that future TRE and then the next completion event
following the previous ordered pattern would be before that rp location
and the loop will run through the entire tre_ring to reach the new rp
location.

I do have another question though, the driver code seems to in some
cases take the mhi_chan->lock when updating the doorbell register, but
not when queueing new transfers.  What is the actual purpose of this
lock and why does it seem so inconsistently used?  Is there any chance
that some of my problems may be the result of queueing new transfers
racing somehow with completion event processing?
Are you pointing to MHI_EV_CC_OOB and MHI_EV_CC_DB_MODE completion code 
? To prove that race condition you disable burst mode as an experiment

In pci_generic.c
#define MHI_CHANNEL_CONFIG_HW_UL(ch_num, ch_name, el_count, ev_ring)

- .doorbell = MHI_DB_BRST_ENABLE
+ .doorbell = MHI_DB_BRST_DISABLE

I agree that locking is an issue in that case as we are taking 
read/write channel lock in parse_xfer_event and using only pm_lock in 
queue API.

Thanks,
Hemant

Thanks,
Paul

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum, a Linux Foundation Collaborative Project