On Thu, 2021-08-26 at 09:54 -0600, Jeffrey Hugo wrote: > On 8/23/2021 12:47 AM, Paul Davey wrote: > > Hi Hemant, Jeffery > > > > I have some more information after some testing. > > > > > > Do you have a log which prints the TRE being processed? > > > > Basically i > > > > am > > > > trying understand this : by the time you get double free issue, > > > > is > > > > there > > > > any pattern with respect to the TRE that is being processed. > > > > For > > > > example > > > > when host processed the given TRE for the first time with RP1, > > > > stale > > > > TRE > > > > was posted by Event RP2 right after RP1 > > > > > > > > ->RP1 [TRE1] > > > > ->RP2 [TRE1] > > > > > > > > or occurrence of stale TRE event is random? > > > > I have now collected some information by adding buffers which > > record > > some of the information desired and searching or printing this > > information only when the issue is detected in order to avoid > > constant > > verbose debug information and potential slowdowns. > > > > From this information I can report that when this issue happens > > two > > consecutive transfer completion events occur with the same TRE > > pointer > > in them, I did not record events which are not transfer completion > > events or the event ring RP during processing. > > > > So the event is as follows: > > > > mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 > > ptr: > > 77c94780 > > mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 > > ptr: > > 77c94780 > > This isn't good. I would suspect that the device is glitching then, > which should be fixed on the device side, but that doesn't help you > here > and now. > > I'm thinking your change is probably a good idea based on this, but > I > have additional questions. > > Can you check the address of the completion events in the shared > memory > (basically the event ring) when you see this? I want to rule out > the > possibility that host is double processing the same event, and this > is > truly a case of the device duplicating an event. > > I hope that makes sense to you. I have added recording of the event tre address in my debug collecting so that should answer this question when the results from that test come back, which will run over the weekend. Additionally I have another update with some results that occurred since my last mail. Two incidents occurred but I am not sure about their relative timing the TRE addresses suggest they are not immediately after eachother. First my check for cb_buf in buf_info is NULL went off but nothing otherwise untoward seemed to be happening. I have added a check to parse_xfer_event to check if the wp in the buf_info matches the TRE address being processed to try and catch the rings somehow getting out of sync but aside from that I can only think of concurrency related issues causing this problem. Perhaps I should check if the ring in question is full or something to give any insight into this. Secondly we saw the following pattern in completion events: mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr: 7c4004e0 mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr: 7c400520 mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr: 7c4004c0 mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr: 7c4004b0 mhi mhi0: (IP_HW0_MBIM-Up) Completion Event code: 2 length: 5e2 ptr: 7c4004a0 Here we can see that instead of a completion event for 7c4004d0 we have one for 7c400520 which is significantly ahead of the other point and from the list of TREs I store in mhi_gen_tre I suspect that 7c400520 is the next TRE to be used in the TRE ring at this time, as the other information shows it would be the oldest entry in that list. I am not sure what could have caused this but this is a different case to the modem repeating the same TRE in a completion event. Thanks, Paul