[BUG] 2.4.23-pre3 ifconfig hanging

"Michael G. Janicki" <mjanicki@chartconnect.com> · Mon, 8 Sep 2003 09:19:05 -0700 (PDT)




Symptom:

  After down-ing an interface (lo and eth0 tested) using ifconfig, further
attempts to down the interface with ifconfig will hang for all interfaces.
Context switches jump noticeably and ifconfig sits waiting on an ioctl().
This occurs with kernel 2.4.23-pre3.  Kernel 2.4.23-pre2 does not
exhibit this behavior.

  I've come across one or two others who have experienced this with pre3
on lkml but there was not much information there to go on so I thought
I'd try the net list.


> uname -a
Linux wolf 2.4.23-pre3 #1 Thu Sep 4 16:54:12 PDT 2003 i686 unknown

> ifconfig --version
net-tools 1.60
ifconfig 1.42 (2001-04-13)


Reproduce:

ifconfig lo up
ifconfig lo down
ifconfig lo up
ifconfig lo down (ifconfig is now hung for all interfaces)

(this occurs for eth0 as well)

-------------
When this occurs, ifconfig sits here:

execve("/sbin/ifconfig", ["ifconfig", "lo", "down"], [/* 33 vars */]) = 0
brk(0)                                  = 0x8055914
open("/etc/ld.so.preload", O_RDONLY)    = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=52547, ...}) = 0
old_mmap(NULL, 52547, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40015000
close(3)                                = 0
open("/lib/libc.so.6", O_RDONLY)        = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0h\222\1"..., 1024) = 1024
fstat64(3, {st_mode=S_IFREG|0755, st_size=5029105, ...}) = 0
old_mmap(NULL, 1191168, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x40022000
mprotect(0x4013b000, 40192, PROT_NONE)  = 0
old_mmap(0x4013b000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x119000) = 0x4013b000
old_mmap(0x40141000, 15616, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x40141000
close(3)                                = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40145000
munmap(0x40015000, 52547)               = 0
brk(0)                                  = 0x8055914
brk(0x805593c)                          = 0x805593c
brk(0x8056000)                          = 0x8056000
uname({sys="Linux", node="wolf", ...})  = 0
access("/proc/net", R_OK)               = 0
access("/proc/net/unix", R_OK)          = 0
socket(PF_UNIX, SOCK_DGRAM, 0)          = 3
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 4
access("/proc/net/if_inet6", R_OK)      = -1 ENOENT (No such file or directory)
access("/proc/net/ax25", R_OK)          = -1 ENOENT (No such file or directory)
access("/proc/net/nr", R_OK)            = -1 ENOENT (No such file or directory)
access("/proc/net/ipx", R_OK)           = -1 ENOENT (No such file or directory)
access("/proc/net/appletalk", R_OK)     = -1 ENOENT (No such file or directory)
access("/proc/net/x25", R_OK)           = -1 ENOENT (No such file or directory)
ioctl(4, 0x8913, 0xbffff76c)            = 0
ioctl(4, 0x8914 <unfinished ...>

-------------

When lo is working:
> cat /proc/net/sockstat
sockets: used 2
TCP: inuse 0 orphan 0 tw 0 alloc 0 mem 0
UDP: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

When lo is hung:
> cat /proc/net/sockstat
sockets: used 4
TCP: inuse 0 orphan 0 tw 0 alloc 0 mem 0
UDP: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

-------------
Other notes:

When ifconfig hangs, context switches on this box in single-user mode
go from single digits to a steady 200+.

Speculation:
(aka: I'm not very familiar with the network code, so this may be
a completely inaccurate assessment.)

As the problem seems to have been introduced in 2.4.23-pre3, I
thought I'd look at the patch to see if anything jumped out.  The
only thing that jumps out at me is that in net/core/dev.c a
while (test_bit(...) has been replaced with a call to a common
function which instead does while (test_and_set_bit(...).

[from patch-2.4.22-pre2-pre3.tar.bz2]

+++ linux-2.4.23-pre3/include/linux/netdevice.h 2003-09-03 15:18:12.000000000 -0
@@ -802,6 +802,38 @@
        local_irq_restore(flags);
 }

+static inline void netif_poll_disable(struct net_device *dev)
+{
+       while (test_and_set_bit(__LINK_STATE_RX_SCHED, &dev->state)) {
+               /* No hurry. */
+               current->state = TASK_INTERRUPTIBLE;
+               schedule_timeout(1);
+       }
+}


+++ linux-2.4.23-pre3/net/core/dev.c    2003-09-03 15:18:13.000000000 -0700
@@ -851,11 +851,7 @@
         * engine, but this requires more changes in devices. */

        smp_mb__after_clear_bit(); /* Commit netif_running(). */
-       while (test_bit(__LINK_STATE_RX_SCHED, &dev->state)) {
-               /* No hurry. */
-               current->state = TASK_INTERRUPTIBLE;
-               schedule_timeout(1);
-       }
+       netif_poll_disable(dev);


  Placing the original while loop back into net/core/dev.c seems to
clear up the problem but that doesn't seem like a proper fix so I
thought I'd post what I've seen in trying to track this down so those
in the know might have a look.


-- 
Michael G. Janicki <mjanicki@chartconnect.com>


-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html