Re: [PATCH v2] usbnet: fix kernel crash after disconnect

David Miller <davem@xxxxxxxxxxxxx> · Mon, 20 May 2019 20:09:22 -0400 (EDT)

From: Kloetzke Jan <Jan.Kloetzke@xxxxxxx>
Date: Thu, 16 May 2019 07:10:30 +0000

> Am Montag, den 06.05.2019, 10:17 +0200 schrieb Oliver Neukum:
>> On So, 2019-05-05 at 00:45 -0700, David Miller wrote:
>> > From: Kloetzke Jan <Jan.Kloetzke@xxxxxxx>
>> > Date: Tue, 30 Apr 2019 14:15:07 +0000
>> > 
>> > > @@ -1431,6 +1432,11 @@ netdev_tx_t usbnet_start_xmit (struct sk_buff *skb,
>> > >               spin_unlock_irqrestore(&dev->txq.lock, flags);
>> > >               goto drop;
>> > >       }
>> > > +     if (WARN_ON(netif_queue_stopped(net))) {
>> > > +             usb_autopm_put_interface_async(dev->intf);
>> > > +             spin_unlock_irqrestore(&dev->txq.lock, flags);
>> > > +             goto drop;
>> > > +     }
>> > 
>> > If this is known to happen and is expected, then we should not warn.
>> > 
>> 
>> yes this is the point. Can ndo_start_xmit() and ndo_stop() race?
>> If not, why does the patch fix the observed issue and what
>> prevents the race? Something is not clear here.
> 
> Dave, could you shed some light on Olivers question? If the race can
> happen then we can stick to v1 because the WARN_ON is indeed pointless.
> Otherwise it's not clear why it made the problem go away for us and v2
> may be the better option...

Yes I think they can race.   ->ndo_stop() executes and stops the queue,
then we get an RCU grace period so that all parallel executions of
->ndo_start_xmit() complete.

But I wonder, this can probably cause problems because some drivers have
"stop queue and re-check" logic, f.e. in drivers/net/tg3.c we have:

	if (unlikely(tg3_tx_avail(tnapi) <= (MAX_SKB_FRAGS + 1))) {
		netif_tx_stop_queue(txq);

		/* netif_tx_stop_queue() must be done before checking
		 * checking tx index in tg3_tx_avail() below, because in
		 * tg3_tx(), we update tx index before checking for
		 * netif_tx_queue_stopped().
		 */
		smp_mb();
		if (tg3_tx_avail(tnapi) > TG3_TX_WAKEUP_THRESH(tnapi))
			netif_tx_wake_queue(txq);
	}

which in the racey scenerio would undo ->ndo_stop()'s work which is
completely unexpected.

Hmmm...