Re: [PATCH] xHCI: fix bug in xhci_clear_command_ring()

Julian Sikorski <belegdol@xxxxxxxxx> · Thu, 01 Dec 2011 13:57:51 +0100

Hi,

OK, this one port dead story seems like an isolated, unrelated incident.
I have been suspending and resuming the machine many times, plugging the
drive into both ports and the controller seems rock solid with this new
patch. The fact that I was able to suspend every time just re-confirms
that it keeps responding. As usual, /var/log/messages is attached.

Julian

W dniu 01.12.2011 02:14, Julian Sikorski pisze:
> I am having a mixed answer. Here is what I did:
> 
> I plugged the drive in
> -disconnected it
> - suspended/resumed
> - reconnected
> - used it for 90 minutes
> Everything was fine, which seems better than an unpatched kernel case. I
> then continued:
> - suspended it with the drive connected (around 01:51:52)
> - resumed, the drive still worked
> Unfortunalely, the second port stopped responding (01:57:05). Another
> one or two suspend-resume cycles did not bring it back to life, but the
> first port was still working fine.
> I am not sure if this is not a different problem, since normally after a
> failure the system would not suspend at all. This time one port just
> seem to be acting out. Oddly enough, nothing was ever connected to it
> during this session. I will keep testing since something might
> definitely be going on (it is definitely more stable, but let's hold on
> with the final call).
> In the meantime, please have a look at /var/log/messages, maybe there is
> something interesting in it.
> 
> Regards,
> Julian
> 
> 
> W dniu 30.11.2011 19:29, Sarah Sharp pisze:
>> Good catch!
>>
>> Is there any chance that Julian's instability after system resume is
>> related to this bug?  If you forced a reset resume, the xHCI driver
>> would have reallocated the command ring with a proper link TRB.  Without
>> the reset resume, the zeroed command ring wouldn't have a link TRB and
>> the host controller would have eventually walked off the end of the
>> command ring.  That might explain why the host controller stopped
>> responding to the stop endpoint command without the reset resume, but
>> only after a very long time (half an hour).
>>
>> Julian, can you revert Andiry's patch to add the reset resume, add this
>> patch instead, and see if it fixes your instability issues?  If so, I
>> think this is a better fix.
>>
>> Sarah Sharp
>>
>> On Wed, Nov 30, 2011 at 04:37:41PM +0800, Andiry Xu wrote:
>>> When system enters suspend, xHCI driver clears command ring by writing zero
>>> to all the TRBs. However, this also writes zero to the Link TRB, and the ring
>>> is mangled. This may cause driver accesses wrong memory address and the
>>> result is unpredicted.
>>>
>>> When clear the command ring, keep the last Link TRB intact, only clear its
>>> cycle bit. This should fix the "command ring full" issue reported by Oliver
>>> Neukum.
>>>
>>> This should be backported to stable kernels as old as 2.6.37, since the
>>> commit 89821320 "xhci: Fix command ring replay after resume" is merged.
>>>
>>> Signed-off-by: Andiry Xu <andiry.xu@xxxxxxx>
>>> ---
>>>  drivers/usb/host/xhci.c |    5 ++++-
>>>  1 files changed, 4 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
>>> index aa94c01..a1afb7c 100644
>>> --- a/drivers/usb/host/xhci.c
>>> +++ b/drivers/usb/host/xhci.c
>>> @@ -711,7 +711,10 @@ static void xhci_clear_command_ring(struct xhci_hcd *xhci)
>>>  	ring = xhci->cmd_ring;
>>>  	seg = ring->deq_seg;
>>>  	do {
>>> -		memset(seg->trbs, 0, SEGMENT_SIZE);
>>> +		memset(seg->trbs, 0,
>>> +			sizeof(union xhci_trb) * (TRBS_PER_SEGMENT - 1));
>>> +		seg->trbs[TRBS_PER_SEGMENT - 1].link.control &=
>>> +			cpu_to_le32(~TRB_CYCLE);
>>>  		seg = seg->next;
>>>  	} while (seg != ring->deq_seg);
>>>  
>>> -- 
>>> 1.7.4.1
>>>
>>>
> 

Attachment:
messages.xz

Description: application/xz