On 2023/7/26 22:20, Alan Stern wrote: > On Wed, Jul 26, 2023 at 02:58:18PM +0800, liulongfang wrote: >> On 2023/7/21 22:57, Alan Stern Wrote: >>> On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote: >>>> On systems that use ECC memory. The ECC error of the memory will >>>> cause the USB controller to halt. It causes the usb_control_msg() >>>> operation to fail. >>> >>> How often does this happen in real life? (Besides, it seems to me that >>> if your system is getting a bunch of ECC memory errors then you've got >>> much worse problems than a simple USB failure!) >>> >> >> This problem is on ECC memory platform. >> In the test scenario, the problem is 100% reproducible. >> >>> And why do you worry about ECC memory failures in particular? Can't >>> _any_ kind of failure cause the usb_control_msg() operation to fail? >>> >>>> At this point, the returned buffer data is an abnormal value, and >>>> continuing to use it will lead to incorrect results. >>> >>> The driver already contains code to check for abnormal values. The >>> check is not perfect, but it should prevent things from going too >>> badly wrong. >>> >> >> If it is ECC memory error. These parameter checks would also >> actually be invalid. >> >>>> Therefore, it is necessary to judge the return value and exit. >>>> >>>> Signed-off-by: liulongfang <liulongfang@xxxxxxxxxx> >>> >>> There is a flaw in your reasoning. >>> >>> The operation carried out here is deliberately unsafe (for full-speed >>> devices). It is made before we know the actual maxpacket size for ep0, >>> and as a result it might return an error code even when it works okay. >>> This shouldn't happen, but a lot of USB hardware is unreliable. >>> >>> Therefore we must not ignore the result merely because r < 0. If we do >>> that, the kernel might stop working with some devices. >>> >> It may be that the handling solution for ECC errors is different from that >> of the OS platform. On the test platform, after usb_control_msg() fails, >> reading the memory data of buf will directly lead to kernel crash: > > All right, then here's a proposal for a different way to solve the > problem: Change the kernel's handler for the ECC error notification. > Have it clear the affected parts of memory, so that the kernel can go > ahead and use them without crashing. > > It seems to me that something along these lines must be necessary in > any case. Unless the bad memory is cleared somehow, it would never be > usable again. The kernel might deallocate it, then reallocate for > another purpose, and then crash when the new user tries to access it. > > In fact, this scenario could still happen even with your patch, which > means the patch doesn't really fix the problem. > This patch is only used to prevent data in the buffer from being accessed. As long as the data is not accessed, the kernel does not crash. thanks, Longfang. > Alan Stern > > . >