Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer

Jeffrey Hugo <quic_jhugo@xxxxxxxxxxx> · Mon, 16 Aug 2021 07:48:29 -0600

On 8/13/2021 5:10 PM, Hemant Kumar wrote:
One more thing to add

On 8/13/2021 3:55 PM, Hemant Kumar wrote:
Hi Paul,

On 8/6/2021 2:43 AM, Loic Poulain wrote:
+ MHI people

On Fri, 6 Aug 2021 at 06:20, Paul Davey 
<Paul.Davey@xxxxxxxxxxxxxxxxxxx> wrote:

Hi linux-arm-msm list,

We have been using the mhi driver with a Sierra EM9191 5G modem module
and have seen an occasional issue where the kernel would crash with
messages showing "BUG: Bad page state" which we debugged further and
found it was due to mhi_net_ul_callback freeing the same skb multiple
times, further debugging tracked this down to a case where
parse_xfer_event computed a dev_rp from the passed event's ev_tre
which does not lie within the region of valid "in flight" transfers
for the tre_ring.  See the patch below for how this was detected.

I believe that receiving such an event results in the loop which runs
completion events for the transfers to re-run some completion
callbacks as it walks all the way around the ring again to reach the
invalid dev_rp position.
Do you have a log which prints the TRE being processed? Basically i am 
trying understand this : by the time you get double free issue, is 
there any pattern with respect to the TRE that is being processed. For 
example
when host processed the given TRE for the first time with RP1, stale 
TRE was posted by Event RP2 right after RP1

->RP1 [TRE1]
->RP2 [TRE1]

or occurrence of stale TRE event is random?
If you can log all the events you are processing, so that we can check 
when second event arrives for already processed TRE, is the transfer 
length same as originally processed TRE or it is different. In case it 
is different length, is the length matching to the TRE which was queue 
but not processed yet. You can print the mhi_queue_skb TRE content while 
queuing skb. How easy to reproduce this issue ? Is this showing up in 
high throughput use case or it is random? any specific step to reproduce 
this issue?

I would wonder, what is the codebase being testing?  Are the latest MHI 
patches included?  When we saw something similar on AIC100, it was 
addressed by the sanity check changes I upstreamed.