Hi Sean, Thanks again for reviewing my patch submission. This patch updates mceusb RX and TX HALT error handling and recovery reporting, and provides placeholder for a later patch. The possible later patch would have mceusb call usb_reset_device() in place of the new "... requires USB Reset ..." message. I say "possible" because I'm running into multiple issues invoking usb_reset_device() from within mceusb. I'll also mention the difficulty obtaining even a single fault replication (~ 1/month) and test cycle. Additional comments below... ..A Sun On 6/6/2019 5:53 AM, Sean Young wrote: > On Sat, Jun 01, 2019 at 07:35:09PM -0400, A Sun wrote: >> Update dev_err() messages to report status (including success) for each >> step of USB RX HALT and TX HALT error recovery. If error recovery fails, >> show the message: >> stuck RX HALT state requires USB Reset Device to clear >> or >> stuck TX HALT state requires USB Reset Device to clear >> >> >> This patch just affects RX and TX HALT error handling and recovery >> reporting. >> >> A possible future patch may enable the mceusb driver to attempt >> itself usb_reset_device() when necessary. >> >> --- >> >> As seen on very rare occasions with mceusb device 2304:0225 >> [59388.696941] mceusb 1-1.1.2:1.0: Error: urb status = -32 (RX HALT) >> [59388.698838] mceusb 1-1.1.2:1.0: rx clear halt error -32 >> the device can get into RX or TX HALT state where usb_clear_halt() >> also fails and also returns -EPIPE (HALT/STALL). After which, >> all further mceusb device control and data I/O fails with HALT/STALL. >> Cause and problem replication conditions are still unknown. >> >> As seen in https://ubuntuforums.org/showthread.php?t=2406882 >> ... >> mceusb 2-1.4:1.0: Error: urb status = -32 (RX HALT) >> ... >> for an unknown mceusb device, usb_clear_halt() apparently can return >> false success. After which, RX HALT immediately reoccurs resulting in > > I'm having trouble understanding this. usb_clear_halt() returns an int. > Do you mean usb_clear_halt() returns 0? > >> a RX halt and clear halt infinite loop and message flooding. Again, >> cause and problem replication conditions remain unknown. > > So usb_clear_halt() returns 0 yet after the next usb_submit_urb() > the completion function gets -EPIPE again? Yes, for the infinite loop with the unknown mceusb device, RX completion gets -EPIPE, usb_clear_halt() probably returned 0 (success), but after the usb_submit_urb(), the RX completion again gets -EPIPE. > Are you trying to address this problem in this patch or just the error > reporting? If you're trying to change the error reporting then there is > no reason to move code around like you have. This patch is mainly just error reporting. A future patch would replace the "stuck HALT state" placeholder with code. The move code would be for the placeholder. The infinite loop case isn't resolved. But a false usb_clear_halt() status 0 (success) return would be printed/shown by this patch to confirm the infinite loop theory above. > >> >> After observing a rx clear halt fault with mceusb 2304:0225, further >> troubleshooting reveals usb_reset_device() is able to restore device >> functionality. >> $ lsusb >> Bus 001 Device 010: ID 2304:0225 Pinnacle Systems, Inc. Remote Kit Infrared Transceiver >> ... >> $ sudo ./usbreset /dev/bus/usb/001/010 # ioctl(fd, USBDEVFS_RESET, 0); >> Resetting USB device /dev/bus/usb/001/010 >> Reset successful >> $ dmesg >> ... >> [272392.540679] usb 1-1.2.7: reset full-speed USB device number 10 using dwc_otg >> [272392.676616] Registered IR keymap rc-rc6-mce >> ... >> [272392.678313] rc rc0: lirc_dev: driver ir-lirc-codec (mceusb) registered at minor = 0 > > Note the ir-lirc-codec. This is a pre-4.16 kernel. > >> [272392.678616] mceusb 1-1.2.7:1.0: long-range (0x1) receiver active >> [272392.940671] mceusb 1-1.2.7:1.0: Registered Pinnacle Systems PCTV Remote USB with mce emulator interface version 1 >> [272392.940687] mceusb 1-1.2.7:1.0: 2 tx ports (0x3 cabled) and 2 rx sensors (0x1 active) > > >> ... >> >> Additional findings are "modprobe -r mceusb" and "modprobe mceusb" >> are unable to clear stuck RX or TX HALT state. >> Attempts to call usb_reset_endpoint() and usb_reset_configuration() >> from within the mceusb driver also did not clear stuck HALTs. >> >> A possible future patch could have the mceusb call usb_reset_device() >> on itself, when necessary. Limiting the number of HALT error recovery >> attempts may also be necessary to prevent halt/clear-halt loops. >> >> Unresolved for now, deadlock complications occur if mceusb's worker >> thread, mceusb_deferred_kevent(), calls usb_reset_device(), >> which calls mceusb_dev_disconnect(), which calls cancel_work_sync(), >> which waits on the still active worker thread. > > I think you can call usb_lock_device_for_reset() to prevent that problem. Deadlock still occurs in test: mceusb_deferred_kevent() ->usb_reset_device() ->mceusb_dev_disconnect() ->cancel_work_sync() where cancel_work_sync() blocks because mceusb_deferred_kevent() is active. > It would be nice to see this implemented too. > Any additional tips to implement would help! How to validate and test a rare condition with a larger audience? >> >> Signed-off-by: A Sun <as1033x@xxxxxxxxxxx> >> --- >> drivers/media/rc/mceusb.c | 35 +++++++++++++++++++++++++---------- >> 1 file changed, 25 insertions(+), 10 deletions(-) >> >> diff --git a/drivers/media/rc/mceusb.c b/drivers/media/rc/mceusb.c >> index efffb1795..5d81ccafc 100644 >> --- a/drivers/media/rc/mceusb.c >> +++ b/drivers/media/rc/mceusb.c >> @@ -765,7 +765,7 @@ static void mceusb_defer_kevent(struct mceusb_dev *ir, int kevent) >> { >> set_bit(kevent, &ir->kevent_flags); >> if (!schedule_work(&ir->kevent)) >> - dev_err(ir->dev, "kevent %d may have been dropped", kevent); >> + dev_dbg(ir->dev, "kevent %d already scheduled", kevent); >> else >> dev_dbg(ir->dev, "kevent %d scheduled", kevent); >> } >> @@ -1404,19 +1404,26 @@ static void mceusb_deferred_kevent(struct work_struct *work) >> container_of(work, struct mceusb_dev, kevent); >> int status; >> >> + dev_info(ir->dev, "kevent handler called (flags 0x%lx)", >> + ir->kevent_flags); >> + >> if (test_bit(EVENT_RX_HALT, &ir->kevent_flags)) { >> usb_unlink_urb(ir->urb_in); >> status = usb_clear_halt(ir->usbdev, ir->pipe_in); >> + dev_err(ir->dev, "rx clear halt status = %d", status); >> if (status < 0) { >> - dev_err(ir->dev, "rx clear halt error %d", >> - status); >> - } >> - clear_bit(EVENT_RX_HALT, &ir->kevent_flags); >> - if (status == 0) { >> + /* >> + * Unable to clear RX stall/halt. >> + * Will need to call usb_reset_device(). >> + */ >> + dev_err(ir->dev, >> + "stuck RX HALT state requires USB Reset Device to clear"); > > Here you say if the usb_clear_halt() returns < 0 then we're in the bad > code path. But in this code path, we're not re-submitting the urb so > we wouldn't end up in an infinite loop. > The theory for the infinite loop case is usb_clear_halt() returned 0. The infinite loop isn't resolved with this patch. > >> + } else { >> + clear_bit(EVENT_RX_HALT, &ir->kevent_flags); > > Why have you moved this line? > Clear bit but only if usb_clear_halt() recovery succeeds. A future usb_device_reset() operation affects both RX and TX and its success code path must clear both EVENT_RX_HALT and EVENT_TX_HALT and do its own usb_submit_urb() with message "rx reset device submit urb status = %d" Hence, the moved line. >> status = usb_submit_urb(ir->urb_in, GFP_KERNEL); >> if (status < 0) { >> dev_err(ir->dev, >> - "rx unhalt submit urb error %d", >> + "rx unhalt submit urb error = %d", >> status); >> } >> } > > I don't understand how this code change has changed anything. In the existing > code and the patch code the urb only gets re-submitted if usb_clear_halt() > succeeds. > This patch is mainly just error reporting revisions. A future patch would replace the "stuck HALT state" message and placeholder with code to perform USB reset. >> @@ -1424,9 +1431,17 @@ static void mceusb_deferred_kevent(struct work_struct *work) >> >> if (test_bit(EVENT_TX_HALT, &ir->kevent_flags)) { >> status = usb_clear_halt(ir->usbdev, ir->pipe_out); >> - if (status < 0) >> - dev_err(ir->dev, "tx clear halt error %d", status); >> - clear_bit(EVENT_TX_HALT, &ir->kevent_flags); >> + dev_err(ir->dev, "tx clear halt status = %d", status); >> + if (status < 0) { >> + /* >> + * Unable to clear TX stall/halt. >> + * Will need to call usb_reset_device(). >> + */ >> + dev_err(ir->dev, >> + "stuck TX HALT state requires USB Reset Device to clear"); >> + } else { >> + clear_bit(EVENT_TX_HALT, &ir->kevent_flags); >> + } >> } >> } >> >> -- >> 2.11.0 >