Re: [PATCH] USB:bugfix a controller halt error

liulongfang <liulongfang@xxxxxxxxxx> · Thu, 27 Jul 2023 15:03:57 +0800

On 2023/7/26 22:20, Alan Stern wrote:
> On Wed, Jul 26, 2023 at 02:58:18PM +0800, liulongfang wrote:
>> On 2023/7/21 22:57, Alan Stern Wrote:
>>> On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote:
>>>> On systems that use ECC memory. The ECC error of the memory will
>>>> cause the USB controller to halt. It causes the usb_control_msg()
>>>> operation to fail.
>>>
>>> How often does this happen in real life?  (Besides, it seems to me that 
>>> if your system is getting a bunch of ECC memory errors then you've got 
>>> much worse problems than a simple USB failure!)
>>>
>>
>> This problem is on ECC memory platform.
>> In the test scenario, the problem is 100% reproducible.
>>
>>> And why do you worry about ECC memory failures in particular?  Can't 
>>> _any_ kind of failure cause the usb_control_msg() operation to fail?
>>>
>>>> At this point, the returned buffer data is an abnormal value, and
>>>> continuing to use it will lead to incorrect results.
>>>
>>> The driver already contains code to check for abnormal values.  The 
>>> check is not perfect, but it should prevent things from going too 
>>> badly wrong.
>>>
>>
>> If it is ECC memory error. These parameter checks would also
>> actually be invalid.
>>
>>>> Therefore, it is necessary to judge the return value and exit.
>>>>
>>>> Signed-off-by: liulongfang <liulongfang@xxxxxxxxxx>
>>>
>>> There is a flaw in your reasoning.
>>>
>>> The operation carried out here is deliberately unsafe (for full-speed 
>>> devices).  It is made before we know the actual maxpacket size for ep0, 
>>> and as a result it might return an error code even when it works okay.  
>>> This shouldn't happen, but a lot of USB hardware is unreliable.
>>>
>>> Therefore we must not ignore the result merely because r < 0.  If we do 
>>> that, the kernel might stop working with some devices.
>>>
>> It may be that the handling solution for ECC errors is different from that
>> of the OS platform. On the test platform, after usb_control_msg() fails,
>> reading the memory data of buf will directly lead to kernel crash:
> 
> All right, then here's a proposal for a different way to solve the 
> problem: Change the kernel's handler for the ECC error notification.  
> Have it clear the affected parts of memory, so that the kernel can go 
> ahead and use them without crashing.
> 
> It seems to me that something along these lines must be necessary in 
> any case.  Unless the bad memory is cleared somehow, it would never be 
> usable again.  The kernel might deallocate it, then reallocate for 
> another purpose, and then crash when the new user tries to access it.  
> 
> In fact, this scenario could still happen even with your patch, which 
> means the patch doesn't really fix the problem.
> 

This patch is only used to prevent data in the buffer from being accessed.
As long as the data is not accessed, the kernel does not crash.

thanks,
Longfang.

> Alan Stern
> 
> .
>