Re: bus: mhi: parse_xfer_event running transfer completion callbacks more than once for a given buffer

Paul Davey <Paul.Davey@xxxxxxxxxxxxxxxxxxx> · Sun, 15 Aug 2021 23:30:28 +0000

Hi Hemant,

On Fri, 2021-08-13 at 15:55 -0700, Hemant Kumar wrote:
> Hi Paul,
> 
> On 8/6/2021 2:43 AM, Loic Poulain wrote:
> > + MHI people
> > 
> > On Fri, 6 Aug 2021 at 06:20, Paul Davey <
> > Paul.Davey@xxxxxxxxxxxxxxxxxxx> wrote:
> > > 
> > > Hi linux-arm-msm list,
> > > 
> > > [..]
> 
> Do you have a log which prints the TRE being processed? Basically i
> am 
> trying understand this : by the time you get double free issue, is
> there 
> any pattern with respect to the TRE that is being processed. For
> example
> when host processed the given TRE for the first time with RP1, stale
> TRE 
> was posted by Event RP2 right after RP1
> 
> ->RP1 [TRE1]
> ->RP2 [TRE1]
> 
> or occurrence of stale TRE event is random?

I have not logged all the TRE events yet, the incidence of processing
an event where the dev_rp inferred from the event (ev_tre + 1) is not
within the software tre_ring->rp and tre_ring->wp seems to be random or
at least inconsistent, but I will need to collect more debug to tell
what the sequence of events looks like.

I suspect the double free mostly stems from the fact that if the
computed dev_rp in this function is not between tre_ring->rp and
tre_ring->wp then the only way for the loop to reach the termination
case is to run through the whole ring.

> If you can log all the events you are processing, so that we can
> check 
> when second event arrives for already processed TRE, is the transfer 
> length same as originally processed TRE or it is different. In case
> it 
> is different length, is the length matching to the TRE which was
> queue 
> but not processed yet. You can print the mhi_queue_skb TRE content
> while 
> queuing skb. How easy to reproduce this issue ? Is this showing up
> in 
> high throughput use case or it is random? any specific step to
> reproduce 
> this issue?

I can try to collect a history of the TREs that can be logged when the
event occurs.  

The issue seems somewhat resistant to reproduction I am unsure of all
the factors required for reproduction.  This is during high throughput
testing, we are using the Sierra module's dataloopback mode to test.

The test being used is setting the module into dataloopback mode and
then sending a crafted UDP stream into an ethernet interface on the
device where the destination IP matches the IP address on the mhi
network interface and the source address has a static ARP on that input
interface so the returning traffic will be output again.

A colleague had done some experimentation to see how to make the issue
more likely and it seemed that the combination of the following did so:

 * Setting the IP_HW0_MBIM channel ring lengths to 3000 instead of 128
   while leaving the associated event rings at length 2048.
 * Setting the MHI_MBIM_DEFAULT_MRU to 7500 rather than 3500.

Also, while I thought using the check given in the original email to
avoid processing xfer events with a dev_rp outside the "in-flight"
region would avoid the issue, we have since seen an issue despite this.

In addition to the above I was zeroing out most fields of the buf_info
struct for the TRE before calling the transfer callback and checking if
the cb_buf addr was non NULL before actually calling the callback and
logging if it was ever NULL.  We have seen this even with the default
ring sizes and an MRU of 32768.  Though it is always the upload side
ring that seems to experience it.

> > > [..]
> 
> In theory this is not suppose to happen. once a xfer completion event
> is 
> posted on event ring TRE belongs to Host MHI, Device is not suppose
> to
> work on this TRE any more.

Is there any way it could post these events "out of order"?

> > > [..]
> 
> This assumption is as per MHI spec.
> 
> I am checking internally if there is any know issue on device side.
> This 
> model seems to be Qualcomm® Snapdragon™ X55 ?
> 
I believe this module uses this SoC yes.

Thanks,
Paul