2.6.11-rc5 and 2.6.12: cannot transmit anything

Denis Vlasenko <vda@xxxxxxxxxxxxx> · Mon, 25 Jul 2005 08:17:37 +0300

[resend. Did not reach mailing lists, most probably due
to KMail's unstoppable desire to use base64 encoding :)]

Hi folks,

I reported earlied that around linux-2.6.11-rc5 my home box sometimes
does not want to send anything over ethetnet. That report is repeated below
sig.

I finally managed to nail down where this happens.
I instrumented sch_generic.c to trace what happens with packets
to be sent over interface named "if". 

On 'good' boot, I see   

2005-07-12_17:26:29.72158 kern.info: qdisc_restart: start
2005-07-12_17:26:29.72164 kern.info: qdisc_restart: skb!=NULL
2005-07-12_17:26:29.72166 kern.info: qdisc_restart: if !netif_queue_stopped...
2005-07-12_17:26:29.72167 kern.info: qdisc_restart: ...hard_start_xmit

in the log, on 'bad' one only "qdisc_restart: start".

Below is first report and instrumented part of sch_generic.c.
--
vda

Subject: linux-2.6.11-rc5: mysterious loss of tx

My home box has onboard via-rhine NIC.

Several days ago my father called me and said that
it does not send anything (tcpdump shows only rx'ed pkts
despite pings being attempted etc). I did not investigate
then.

Yesterday I've seen it myself. I bumped up ethtool msglvl.
Looks like via-rhine's hard_start_xmit was not called at all
from network core code! (I did not see debug printks from
rhine's hard_stat_xmit routine)

Whatever I tried (ifconfig down/up, reinit IP config from scratch),
nothing helped. No tx whatsoever was attempted by kernel, it seems.

Reboot 'fixed' things.

It hever happened on the same hardware before I switched to rc5.

int qdisc_restart(struct net_device *dev)
{
        struct Qdisc *q = dev->qdisc;
        struct sk_buff *skb;
int track = (dev->name[0]=='i' && dev->name[1]=='f' && dev->name[2]=='\0');

//'via rhine bug':
//I see ONLY "qdisc_restart: start",
//but not any of below msgs.
//On 'good' boots, it looks like this:
//...
//2005-07-12_17:26:29.72158 kern.info: qdisc_restart: start
//2005-07-12_17:26:29.72164 kern.info: qdisc_restart: skb!=NULL
//2005-07-12_17:26:29.72166 kern.info: qdisc_restart: if !netif_queue_stopped...
//2005-07-12_17:26:29.72167 kern.info: qdisc_restart: ...hard_start_xmit
//...
if(track) { printk("qdisc_restart: start\n"); }
        /* Dequeue packet */
        if ((skb = q->dequeue(q)) != NULL) {
if(track) { printk("qdisc_restart: skb!=NULL\n"); }
                unsigned nolock = (dev->features & NETIF_F_LLTX);
                /*
                 * When the driver has LLTX set it does its own locking
                 * in start_xmit. No need to add additional overhead by
                 * locking again. These checks are worth it because
                 * even uncongested locks can be quite expensive.
                 * The driver can do trylock like here too, in case
                 * of lock congestion it should return -1 and the packet
                 * will be requeued.
                 */
                if (!nolock) {
                        if (!spin_trylock(&dev->xmit_lock)) {
                        collision:
if(track) { printk("qdisc_restart: collision\n"); }
                                /* So, someone grabbed the driver. */

                                /* It may be transient configuration error,
                                   when hard_start_xmit() recurses. We detect
                                   it by checking xmit owner and drop the
                                   packet when deadloop is detected.
                                */
                                if (dev->xmit_lock_owner == smp_processor_id()) {
                                        kfree_skb(skb);
                                        if (net_ratelimit())
                                                printk(KERN_DEBUG "Dead loop on netdevice %s, fix it urgently!\n", dev->name);
                                        return -1;
                                }
                                __get_cpu_var(netdev_rx_stat).cpu_collision++;
                                goto requeue;
                        }
                        /* Remember that the driver is grabbed by us. */
                        dev->xmit_lock_owner = smp_processor_id();
                }

                {
                        /* And release queue */
                        spin_unlock(&dev->queue_lock);

//vda
if(track) { printk("qdisc_restart: if !netif_queue_stopped...\n"); }
                        if (!netif_queue_stopped(dev)) {
                                int ret;
                                if (netdev_nit)
                                        dev_queue_xmit_nit(skb, dev);
if(track) { printk("qdisc_restart: ...hard_start_xmit\n"); }
                                ret = dev->hard_start_xmit(skb, dev);
                                if (ret == NETDEV_TX_OK) {
                                        if (!nolock) {
                                                dev->xmit_lock_owner = -1;
                                                spin_unlock(&dev->xmit_lock);
                                        }
                                        spin_lock(&dev->queue_lock);
                                        return -1;
                                }
                                if (ret == NETDEV_TX_LOCKED && nolock) {
                                        spin_lock(&dev->queue_lock);
                                        goto collision; 
                                }
                        }

                        /* NETDEV_TX_BUSY - we need to requeue */
                        /* Release the driver */
                        if (!nolock) { 
                                dev->xmit_lock_owner = -1;
                                spin_unlock(&dev->xmit_lock);
                        }
                        spin_lock(&dev->queue_lock);
                        q = dev->qdisc;
                }

                /* Device kicked us out :(
                   This is possible in three cases:

                   0. driver is locked
                   1. fastroute is enabled
                   2. device cannot determine busy state
                      before start of transmission (f.e. dialout)
                   3. device is buggy (ppp)
                 */

requeue:
                q->ops->requeue(skb, q);
                netif_schedule(dev);
                return 1;
        }
        BUG_ON((int) q->q.qlen < 0);
        return q->q.qlen;
}

-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html