This is in continuation of my previous mail... >---------- Forwarded message ---------- >From: kapil dakhane <kdakhane@xxxxxxxxx> >Date: Mon, Nov 30, 2009 at 6:02 PM >Subject: soft lockup in inet_csk_get_port >To: netdev@xxxxxxxxxxxxxxx >Cc: netfilter@xxxxxxxxxxxxxxx > > >Hello, > >I am trying to analyze the capacity of linux network stack on x6270 >which has 16 Hyper threads on two 8-core Intel(r) Xeon(r) CPU. This resulted in patch... >---------- Forwarded message ---------- >From: Eric Dumazet <eric.dumazet@xxxxxxxxx> >Date: Wed, Dec 2, 2009 at 7:08 AM >Subject: [PATCH net-next-2.6] tcp: connect() race with timewait reuse >To: David Miller <davem@xxxxxxxxxxxxx> >Cc: kdakhane@xxxxxxxxx, netdev@xxxxxxxxxxxxxxx, netfilter@xxxxxxxxxxxxxxx, Evgeniy Polyakov <zbr@xxxxxxxxxxx> > The test is exactly same as before, except for following changes: 1. linux kernel is now a snapshot of net-next jit maintained by dave-miller. The snapshot was downloaded on Jan 28 tarball name is net-next-2.6-d74340d.tar.gz, uname shows 2.6.33-rc5 as the kernel version. This has all the fixes from the above mentioned patch. 2. Platform is now HS22, which is an IBM bladecenter blade with add-on 10 gb ethernet card from broadcom, "Broadcom Corporation NetXtreme II BCM57710 10-Gigabit PCIe [Everest]". CPU is same as that in previous tests "Intel(R) Xeon(R) CPU X5570 @ 2.93GHz". Test routes both ingress and egress traffic through this card, with the help of vlans. As in previous tests, traffic was transparently captured, and transparently forwarded. 3. Webproxy application now had business logic enabled as opposed to just data forwarding as in previous tests. ---------- Forwarded message ---------- From: Eric Dumazet <eric.dumazet@xxxxxxxxx> Date: Wed, Dec 2, 2009 at 7:08 AM Subject: [PATCH net-next-2.6] tcp: connect() race with timewait reuse To: David Miller <davem@xxxxxxxxxxxxx> Cc: kdakhane@xxxxxxxxx, netdev@xxxxxxxxxxxxxxx, netfilter@xxxxxxxxxxxxxxx, Evgeniy Polyakov <zbr@xxxxxxxxxxx> Tuning parameters have remained same... core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 387089 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 262144 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 387089 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited net.ipv4.tcp_keepalive_intvl = 5 net.ipv4.tcp_keepalive_probes = 3 net.ipv4.tcp_keepalive_time = 180 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_max_syn_backlog = 8192 net.ipv4.tcp_max_tw_buckets = 512000 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_syncookies = 0 net.core.netdev_max_backlog = 5000 mpstat output shows that CPU 9 is stuck in infinite loop. This was observed after the test was terminated. 10:22:36 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 10:22:38 AM all 0.00 0.00 0.00 0.06 0.00 6.25 0.00 93.69 16533.50 10:22:38 AM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 10:22:38 AM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 2.00 10:22:38 AM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 10:22:38 AM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.50 10:22:38 AM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 10:22:38 AM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 3.50 10:22:38 AM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 10:22:38 AM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 10:22:38 AM 8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 2147483647.00 10:22:38 AM 9 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 10:22:38 AM 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 10:22:38 AM 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 10:22:38 AM 12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 10:22:38 AM 13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 2147483647.00 10:22:38 AM 14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 10:22:38 AM 15 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00 3.50 10:22:38 AM 16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:22:38 AM 17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Feb 17 10:23:25 fusion-ch01-bl05 kernel: BUG: soft lockup - CPU#9 stuck for 61s! [webproxy:11957] Feb 17 10:23:25 fusion-ch01-bl05 kernel: Modules linked in: xt_TPROXY xt_tcpudp xt_MARK xt_socket nf_conntrack nf_defrag_ipv4 nf_tproxy_core iptable_mangle ip_tables x_tables autofs4 hidp rfcomm l2cap crc16 bluetooth rfkill lockd sunrpc 8021q ipv6 dm_multipath scsi_dh video output sbs sbshc battery acpi_memhotplug ac parport_pc lp parport cdc_ether usbnet sg mii bnx2x serio_raw button tpm_tis tpm rtc_cmos rtc_core tpm_bios rtc_lib mdio bnx2 i2c_i801 i2c_core pcspkr dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table] Feb 17 10:23:25 fusion-ch01-bl05 kernel: CPU 9 Feb 17 10:23:25 fusion-ch01-bl05 kernel: Pid: 11957, comm: webproxy Tainted: G M W 2.6.33-rc5 #1 49Y5114 /IBM System x -[7870AC1]- Feb 17 10:23:25 fusion-ch01-bl05 kernel: RIP: 0010:[<ffffffff8129c590>] [<ffffffff8129c590>] inet_csk_bind_conflict+0x5f/0xa6 Feb 17 10:23:25 fusion-ch01-bl05 kernel: RSP: 0018:ffff880c17929e30 EFLAGS: 00000202 Feb 17 10:23:25 fusion-ch01-bl05 kernel: RAX: ffffffff81461a01 RBX: ffff880c5d3205a0 RCX: ffff880bd45ef1e8 Feb 17 10:23:25 fusion-ch01-bl05 kernel: RDX: ffff880bd45ef1c0 RSI: 0000000000000000 RDI: ffff880674053300 Feb 17 10:23:25 fusion-ch01-bl05 kernel: RBP: ffffffff810031ce R08: 000000000001b20d R09: ffff880a8ecba128 Feb 17 10:23:25 fusion-ch01-bl05 kernel: R10: ffff880674053301 R11: ffffffff81132b11 R12: ffff880c17929ee8 Feb 17 10:23:25 fusion-ch01-bl05 kernel: R13: ffff880c00000000 R14: 0000000000000000 R15: ffff880c17929d90 Feb 17 10:23:25 fusion-ch01-bl05 kernel: FS: 00007fa55a6c6720(0000) GS:ffff880028340000(0000) knlGS:0000000000000000 Feb 17 10:23:25 fusion-ch01-bl05 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 17 10:23:25 fusion-ch01-bl05 kernel: CR2: 00007fa5269c8000 CR3: 0000000c1bd12000 CR4: 00000000000006e0 Feb 17 10:23:25 fusion-ch01-bl05 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 17 10:23:25 fusion-ch01-bl05 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Feb 17 10:23:25 fusion-ch01-bl05 kernel: Process webproxy (pid: 11957, threadinfo ffff880c17928000, task ffff880c17b700c0) Feb 17 10:23:25 fusion-ch01-bl05 kernel: Stack: Feb 17 10:23:25 fusion-ch01-bl05 kernel: ffffffff8129c789 0000000000000000 0000000500000000 00000000ffffffff Feb 17 10:23:25 fusion-ch01-bl05 kernel: <0> 0000000000000000 0000000000000000 ffff880674053300 00000000ffffffea Feb 17 10:23:25 fusion-ch01-bl05 kernel: <0> 0000000000051005 0000000000000001 ffff880c17929ec8 00007fa55e4923f0 Feb 17 10:23:25 fusion-ch01-bl05 kernel: Call Trace: Feb 17 10:23:25 fusion-ch01-bl05 kernel: [<ffffffff8129c789>] ? inet_csk_get_port+0x1b2/0x29e Feb 17 10:23:25 fusion-ch01-bl05 kernel: [<ffffffff812b9368>] ? inet_bind+0x10c/0x1c1 Feb 17 10:23:25 fusion-ch01-bl05 kernel: [<ffffffff8126497b>] ? sys_bind+0x6e/0x9e Feb 17 10:23:25 fusion-ch01-bl05 kernel: [<ffffffff8106de32>] ? audit_syscall_entry+0x1b9/0x1e4 Feb 17 10:23:25 fusion-ch01-bl05 kernel: [<ffffffff8100286b>] ? system_call_fastpath+0x16/0x1b Feb 17 10:23:25 fusion-ch01-bl05 kernel: Code: 40 62 10 eb 04 f6 42 54 01 75 44 8b 77 20 85 f6 74 0b 8b 42 20 85 c0 74 04 39 c6 75 32 45 84 d2 74 0d 80 7a 1f 00 74 07 8a 42 1e <3c> 0a 75 20 8a 42 1e 3c 06 74 08 8b 82 24 02 00 00 eb 03 8b 42 Feb 17 10:23:25 fusion-ch01-bl05 kernel: Call Trace: Feb 17 10:23:25 fusion-ch01-bl05 kernel: [<ffffffff8129c789>] ? inet_csk_get_port+0x1b2/0x29e Feb 17 10:23:25 fusion-ch01-bl05 kernel: [<ffffffff812b9368>] ? inet_bind+0x10c/0x1c1 Feb 17 10:23:25 fusion-ch01-bl05 kernel: [<ffffffff8126497b>] ? sys_bind+0x6e/0x9e Feb 17 10:23:25 fusion-ch01-bl05 kernel: [<ffffffff8106de32>] ? audit_syscall_entry+0x1b9/0x1e4 Feb 17 10:23:25 fusion-ch01-bl05 kernel: [<ffffffff8100286b>] ? system_call_fastpath+0x16/0x1b Feb 17 10:23:25 fusion-ch01-bl05 kernel: [bnx2x_timer:4677(eth3)]drv_pulse (0x3104) != mcp_pulse (0x3854) Feb 17 10:23:26 fusion-ch01-bl05 kernel: [bnx2x_timer:4677(eth3)]drv_pulse (0x3105) != mcp_pulse (0x3854) Feb 17 10:23:27 fusion-ch01-bl05 kernel: [bnx2x_timer:4677(eth3)]drv_pulse (0x3106) != mcp_pulse (0x3854) Feb 17 10:23:28 fusion-ch01-bl05 kernel: [bnx2x_timer:4677(eth3)]drv_pulse (0x3107) != mcp_pulse (0x3854) These messages keep repeating every 60 seconds. To me this feels like that there are more code paths which lead to the same corruption as in previous issue. Kapil -- To unsubscribe from this list: send the line "unsubscribe netfilter" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html