Re: xhci_hcd crash on linux 4.7.0

Alex Damian <alex.r.damian@xxxxxxxxx> · Wed, 12 Oct 2016 17:13:05 +0100

Hello,

To follow up on the original bug report. I am still experiencing
memory corruption problems in the xhci stack.

One thing I noticed is that the corruption always occur on a secondary
CPU (ie. the stack trace starts on cpu_startup_entry) and it is always
going on when trying to handle an intrerrupt.

Seems to me that a mutex or something similar is not correctly locked,
but I don't have any experience with the code around this part, so I
have no idea where to look.

Pointers, ideas, suggestions ?

Cheers,
Alex

On Thu, Aug 25, 2016 at 2:22 PM, Mathias Nyman
<mathias.nyman@xxxxxxxxxxxxxxx> wrote:
> On 29.07.2016 17:41, Alex Damian wrote:
>>
>> On Fri, Jul 29, 2016 at 2:53 PM, Greg KH <greg@xxxxxxxxx> wrote:
>>>
>>> On Fri, Jul 29, 2016 at 10:58:03AM +0100, Alex Damian wrote:
>>>>
>>>> Hi Greg,
>>>>
>>>> I managed to reproduce with a untainted kernel, see dmesg paste below.
>>>> The stack seemed corrupted as well ?
>>>>
>>>> I refered to it as a crash since after a couple of these issues, the
>>>> machine hard freezes - I set up a serial console via a USB cable, but
>>>> I don't get the kernel oops out of the machine. The network is also
>>>> dead before getting any data. I could not think of any other way to
>>>> get a console out of a Macbook - any ideas ?
>>>>
>>>> There is a progressive level of deterioration going on below, this is
>>>> why I'm adding multiple pastes. See the obviously invalid pointer
>>>> 0000000000000001 in 3rd paste below. Also, see the protection fault in
>>>> the last paste. To me, something is trampling all over memory, and it
>>>> is usb-related.
>>>
>>>
>>> Not good, thanks for reproducing it without the closed kernel drivers.
>>>
>>> If you disable the list debug kernel option, do you have any problems
>>> with the machine?  We aren't having any other reports of issues like
>>> this at the moment, which makes me worry that it's something unique to
>>> your situation/hardware.
>>
>>
>> I strongly suspect it's related to the macbook 12,1 hardware. I
>> haven't been able
>> to reproduce this with other machines, including other macbook
>> versions with the same peripherals.
>>
>> This machine has never been stable in this particular peripheral
>> configuration.
>> I had Apple run all HW diagnostics on the machine, I ran the memcheck
>> to verify that
>> the RAM is ok - all results are clean. The machine is very stable under
>> Mac OSX.
>>
>>> And you don't know that it's a USB problem, only that USB is the one
>>> that is showing the issue.  Anyone could be writing over memory.
>>
>>
>> True. However it seems particularly related to the USB mouse - that's
>> how I manage
>> to reproduce the error.
>>
>>>
>>> Also, any chance you can use 'git bisect' to track down an offending
>>> commit?  I'm assuming that this used to work properly and something
>>> recently caused the issue, correct?
>>
>>
>> The earliest kernels I've tested are in the 3.3 range. All kernels
>> before 4.7 just lock up.
>> 4.7 is the first kernel where I have meaningful dmesg errors before
>> locking up. As such,
>> there is very little that I can do to bisect :(.
>>
>
> Going through xhci related issues that occurred during my vacation.
>
> There is one command list related issue fixed in 4.8-rc3, any chance you
> could try it?
> Alternatively just add the following patch added to 4.7:
> 33be126 xhci: always handle "Command Ring Stopped" events
>
> Enabling xhci debug could reveal something.
> echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
>
> -Mathias
>
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html