On 2023/7/27 23:57, Alan Stern wrote: > On Thu, Jul 27, 2023 at 05:31:41PM +0200, Oliver Neukum wrote: >> On 27.07.23 16:42, Alan Stern wrote: >>> On Thu, Jul 27, 2023 at 03:03:57PM +0800, liulongfang wrote: >>>> On 2023/7/26 22:20, Alan Stern wrote: >> >>>>> It seems to me that something along these lines must be necessary in >>>>> any case. Unless the bad memory is cleared somehow, it would never be >>>>> usable again. The kernel might deallocate it, then reallocate for >>>>> another purpose, and then crash when the new user tries to access it. >>>>> >>>>> In fact, this scenario could still happen even with your patch, which >>>>> means the patch doesn't really fix the problem. >> >> I suppose in theory you could have something like a bad blocks list >> just for RAM, but that would really hurt. You'd have to do something >> about every DMA operation in every driver in theory. >> >> Error handling would basically be an intentional memory leak. > > I started out thinking this way, but maybe that's not how it works. > Perhaps simply overwriting the part of memory that got the ECC error > would clear the error state. (This may depend on the kind of error, > one-time vs. permanent.) > > If that's the case, and if the memory buffer was deallocated without > being accessed and then later reallocated, things would be okay. The > routine that reallocated the buffer wouldn't try to read from it before > initializing it somehow. > >>>> This patch is only used to prevent data in the buffer from being accessed. >>>> As long as the data is not accessed, the kernel does not crash. >>> >>> I still don't understand. You haven't provided nearly enough >>> information. You should start by answering the questions that Oliver >>> asked. Then answer this question: >>> >>> The code you are concerned about is this: >>> >>> r = usb_control_msg(udev, usb_rcvaddr0pipe(), >>> USB_REQ_GET_DESCRIPTOR, USB_DIR_IN, >>> USB_DT_DEVICE << 8, 0, >>> buf, GET_DESCRIPTOR_BUFSIZE, >>> initial_descriptor_timeout); >>> switch (buf->bMaxPacketSize0) { >>> >>> You're worried that if an ECC memory error occurs during the >>> usb_control_msg transfer, the kernel will crash when the "switch" >>> statement tries to read the value of buf->bMaxPacketSize0. That's a >>> reasonable thing to worry about. >> >> Albeit unlikely. If the hardware and implementation are reasonable >> you'd return a specific error code from the HCD and clean up the >> RAM in your ecc driver. >> >> The fix for USB would then conceptually be something like >> >> retryio: >> r = usb_control_msg() >> if (r == -EMEMORYCORRUPTION) >> goto retryio; > > Yes, we could do this, but it's not necessary. Let's say that the HCD > returns -EMEMORYCORRUPTION and the ecc driver cleans up the RAM > (probably by resetting its contents to 0, but possibly leaving garbage > there instead). Then when the following code in hub_port_init() tests > buf->bMaxPacketSize0, it will see an invalid value and will retry the > transfer. > > Or, with low probability, it will see a valid but incorrect value. If > that happens then later transfers using ep0 will fail, causing the hub > driver to reiterate the outer loop in hub_port_connect(). Eventually > the device will be correctly initialized and enumerated. > > Alan Stern > OK, thanks. Longfang. > . >