Re: "usbip: Implement SG support to vhci-hcd and stub driver" causes a deadlock

Suwan Kim <suwan.kim027@xxxxxxxxx> · Wed, 11 Dec 2019 12:07:54 +0900

On Tue, Dec 10, 2019 at 04:32:21PM +0100, Marek Marczykowski-Górecki wrote:
> On Tue, Dec 10, 2019 at 11:25:35PM +0900, Suwan Kim wrote:
> > On Mon, Dec 09, 2019 at 03:19:59PM +0100, Marek Marczykowski-Górecki wrote:
> > > On Mon, Dec 09, 2019 at 03:35:43PM +0900, Suwan Kim wrote:
> > > > On Mon, Dec 09, 2019 at 04:37:40AM +0100, Marek Marczykowski-Górecki wrote:
> > > > > On Mon, Dec 09, 2019 at 11:01:30AM +0900, Suwan Kim wrote:
> > > > > > On Fri, Dec 06, 2019 at 09:57:42PM +0100, Marek Marczykowski-Górecki wrote:
> > > > > > > [  212.890519] usb 1-1: recv xbuf, 42
> > > > > > 
> > > > > > This message is printed by receive error and before that, driver
> > > > > > canceled URB transmission. we need to know the exact situation
> > > > > > before this message.
> > > > > 
> > > > > I've added some more messages and found recv_size is 0.
> > > > 
> > > > That is the bug point. "size" is urb->actual_length that means
> > > > amount of data actually received from device. And "copy" is
> > > > amount of data received from usbip server. So, in this situation,
> > > > vhci-hcd received all the data from usbip server even if there
> > > > are more sg entries left. So, "copy == 0" means vhci-hcd receives
> > > > all data from the server and we should check "if (copy == 0)" in
> > > > for_each_sg() loop of usbip_recv_xbuff() to exit the loop and not
> > > > to add error event.
> > > 
> > > That makes sense. But I think there is also another issue here: hang in
> > > case of an error. Here it was EINVAL, but there are probably other
> > > reasons why usbip_recv can fail, like network error or misbehaving
> > > server. This definitely should not cause the client to fail this way...
> > > And also, the actual error code is lost.
> > 
> > I agree. I have been taking a look at it and trying to reproduce
> > the same issue on my machine. I guess race condition between hub
> > irq thread and driver (vhci_urb_enqueue or hcd?)
> > But I'm not sure...
> 
> I've tried it some more time and one time I've got a traceback pointing
> at lock_release() called from vhci_urb_dequeue, not lock_acquire(). This
> get me thinking it may not be deadlock on a spinlock, but some infinite
> loop. Looking at the source, I think it's about usb_hcd_flush_endpoint
> looping indefinitely because vhci_urb_dequeue() exit early on this:

That really makes sense. Very good debugging. I can also see the
infinite loop now.

When there is transaction error, vhci rx thread adds error event
and event handler tries to shutdown the connection and also changes
the port status of vhci that kicks hub_irq. Then hub_irq cleans up
the device resources including pending URBs (vhci_urb_dequeue).
But as you pointed out, "priv" is already released in vhci rx
thread.

> 
> 
>     spin_lock_irqsave(&vhci->lock, flags);
> 
>     priv = urb->hcpriv;
>     if (!priv) {
>         /* URB was never linked! or will be soon given back by
>          * vhci_rx. */
>         spin_unlock_irqrestore(&vhci->lock, flags);
>         return -EIDRM;
>     }
> 
> Adding a print there confirms it.
> 
> And I think it's because of vhci_recv_ret_submit():
> 
> 
>     spin_lock_irqsave(&vdev->priv_lock, flags);
>     urb = pickup_urb_and_free_priv(vdev, pdu->base.seqnum);
> 
> // ****** priv freed here
> 
>     spin_unlock_irqrestore(&vdev->priv_lock, flags);
> 
>     if (!urb) {
>         pr_err("cannot find a urb of seqnum %u max seqnum %d\n",
>             pdu->base.seqnum,
>             atomic_read(&vhci_hcd->seqnum));
>         usbip_event_add(ud, VDEV_EVENT_ERROR_TCP);
>         return;
>     }
>     
>     /* unpack the pdu to a urb */
>     usbip_pack_pdu(pdu, urb, USBIP_RET_SUBMIT, 0);
> 
>     /* recv transfer buffer */
>     if (usbip_recv_xbuff(ud, urb) < 0)
> // ***** exit early here
>         return;
> 
>     /* recv iso_packet_descriptor */
>     if (usbip_recv_iso(ud, urb) < 0)
>         return;
> 
> I'm not really sure what should happen, but I think some cleanup in case
> of usbip_recv_xbuff() failure is missing. And probably in case of
> usbip_recv_iso() failure too.

HCD should giveback URB to driver calling usb_hcd_giveback_urb().
But in the case of transaction error, vhci_recv_ret_submit()
terminates without giveback URB. So, I think the error path in
usbip_recv_xbuff and usbip_recv_iso should unlink URB from ep and
insert proper error code in urb->status that indicates error
status to driver and handle giveback URB to driver.

Then hub_irq doesn't need to flush error URB.
That will be helpful to graceful connection shutdown.

static void vhci_recv_ret_submit(struct vhci_device *vdev,
				 struct usbip_header *pdu)
...
...
	if (usbip_recv_xbuff(ud, urb) < 0) {
		urb->status = -EPIPE;
		goto error;	// goto error path
	}

	/* recv iso_packet_descriptor */
	if (usbip_recv_iso(ud, urb) < 0) {
		urb->status = -EPIPE;
		goto error;	// goto error path
	}
...
...
error://error path
	spin_lock_irqsave(&vhci->lock, flags);
	usb_hcd_unlink_urb_from_ep(vhci_hcd_to_hcd(vhci_hcd), urb);
	spin_unlock_irqrestore(&vhci->lock, flags);

	usb_hcd_giveback_urb(vhci_hcd_to_hcd(vhci_hcd), urb, urb->status);

	usbip_dbg_vhci_rx("Leave\n");
}

Regards,
Suwan Kim

> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?