On Wed, Feb 27, 2013 at 10:26 PM, Ronciak, John <john.ronciak@xxxxxxxxx> wrote: > Are both NIC's new? They are of the same family so maybe the "10e6" NIC was somehow damaged. If that NIC is the only card having problem in that exact slot it would guess that it's that NIC that is bad. both NIC are new. And we have 10 numbers each of those cards. we tested all the ten 8086:10e6 nic, but same problem happens. How can you confirm this is a real hw problem ? -Ratheesh > Cheers, > John > > >> -----Original Message----- >> From: ratheesh kannoth [mailto:ratheesh.ksz@xxxxxxxxx] >> Sent: Wednesday, February 27, 2013 8:51 AM >> To: Ronciak, John >> Cc: e1000-devel@xxxxxxxxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx >> Subject: Re: [E1000-devel] pcie error >> >> On Wed, Feb 27, 2013 at 10:07 PM, Ronciak, John >> <john.ronciak@xxxxxxxxx> wrote: >> > Looks like you have a HW problem. Is this a new motherboard? >> Something you built? Can you take out all the devices from the system >> (possibly using the BIOS to m/b based devices) and see if the problem >> is still happening? >> >> This is a new motherboard. But we have tried a similar pci express nic >> card of 8086:10c9. But it works fine. But when we try with nic of >> 8086:10e6 , this problem happens. >> >> the pci express error gets propagated to root node ? and fails there ?. >> >> Which hw is having problem ? the pci card or mother board ? how can i >> conclude ? >> >> >> Thanks >> >> >> >> >> -----Original Message----- >> >> From: ratheesh kannoth [mailto:ratheesh.ksz@xxxxxxxxx] >> >> Sent: Wednesday, February 27, 2013 8:30 AM >> >> To: Ronciak, John >> >> Cc: e1000-devel@xxxxxxxxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx >> >> Subject: Re: [E1000-devel] pcie error >> >> >> >> Hi John, >> >> >> >> Thanks a lot for your reply. >> >> >> >> I have added a pci-express nic card in the pci -express system slot >> . >> >> This nic card is 8086:10e6 based. I could see the error when i send >> >> traffic thru this port and kernel panic. when i looked at >> >> /var/log/messages , i could see >> >> >> >> aer_isr_one_error->can't find device of ID0000 >> >> aer_isr_one_error->can't find device of ID0000 >> >> aer_isr_one_error->can't find device of ID0000 aer_isr_one_error- >> >can't find device of ID0000 ..... >> >> .... >> >> +------ PCI-Express Device Error ------+ >> >> Error Severity : Uncorrected (Non-Fatal) >> >> PCIE Bus Error type : Transaction Layer >> >> Completion Timeout : Multiple >> >> Requester ID : 0028 >> >> VendorID=8086h, DeviceID=d13ah, Bus=00h, Device=05h, Function=00h >> >> igb: ge1_0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX >> >> igb: ge1_1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX >> >> >> >> >> >> >> >> >> >> [ kernel panic console message ] >> >> >> >> HARDWARE ERROR >> >> CPU 7: Machine Check Exception: 4 Bank 8: >> >> 0000000000000000 >> >> TSC 0 >> >> This is not a software problem! >> >> Run through mcelog --ascii to decode and contact your hardware >> vendor >> >> Kernel panic - not syncing: Machine check ------------[ cut here >> >> ]------------ >> >> WARNING: at kernel/smp.c:329 smp_call_function_many+0x40/0x1e5() >> >> Hardware name: 342? Modules linked in: nf_conntrack_ipv4 >> >> nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp iptable_filter >> >> ip_tables x_tables bnx2 e100 mii igb_cids ixgbe_cids e1000_cids >> >> cids_shared bpctl_mod cidmodcap cpp_base(P) linux_user_bde(P) >> >> linux_kernel_bde(P) >> >> Pid: 3491, comm: sensorApp Tainted: P 2.6.29.1 #14 >> >> Call Trace: >> >> <#MC> [<ffffffff8023a34f>] warn_slowpath+0xd3/0x10f >> >> [<ffffffff80220733>] ? default_spin_lock_flags+0x9/0xe >> >> [<ffffffff8023aa9a>] ? release_console_sem+0x199/0x1ce >> >> [<ffffffff8050dff7>] ? printk+0x67/0x70 [<ffffffff80220733>] ? >> >> default_spin_lock_flags+0x9/0xe [<ffffffff8025827f>] >> >> smp_call_function_many+0x40/0x1e5 [<ffffffff80211507>] ? >> >> stop_this_cpu+0x0/0x2c [<ffffffff8023aa9a>] ? >> >> release_console_sem+0x199/0x1ce [<ffffffff80258444>] >> >> smp_call_function+0x20/0x24 [<ffffffff8021b37a>] >> >> native_smp_send_stop+0x22/0x49 [<ffffffff8050dee6>] >> panic+0xa8/0x152 >> >> [<ffffffff8023a4b7>] ? oops_enter+0xe/0x10 [<ffffffff805112dc>] ? >> >> oops_begin+0x7e/0x8c [<ffffffff80216da4>] ? print_mce+0xe8/0xec >> >> [<ffffffff80216e15>] mce_log+0x0/0x7f [<ffffffff802171d7>] >> >> do_machine_check+0x302/0x3d7 [<ffffffff8051076b>] >> >> machine_check+0x1b/0x20 <<EOE>> <4>---[ end trace 877905393052419b >> >> ]--- >> >> Rebooting in 1 seconds.. >> >> >> >> >> >> 1. is there any way to narrow down the system error ? >> >> 2. any clue or hint is really appreciated. >> >> >> >> -Ratheesh >> >> >> >> >> >> On Wed, Feb 27, 2013 at 9:48 PM, Ronciak, John >> >> <john.ronciak@xxxxxxxxx> >> >> wrote: >> >> > The "d13a" device is not a networking device. So I'm not sure >> what >> >> you cut from the logs but the igb messages have nothing to do with >> >> this device. According to the Device ID's repository the "d13a" >> >> device is a "Core Processor PCI Express Root Port 3". >> >> > >> >> > So this isn't a networking device error but some sort of system >> >> error. >> >> > >> >> > Cheers, >> >> > John >> >> > >> >> > >> >> >> -----Original Message----- >> >> >> From: ratheesh kannoth [mailto:ratheesh.ksz@xxxxxxxxx] >> >> >> Sent: Wednesday, February 27, 2013 2:40 AM >> >> >> To: e1000-devel@xxxxxxxxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx >> >> >> Subject: [E1000-devel] pcie error >> >> >> >> >> >> I am getting an error when i send traffic thru 8086:10e6 device >> >> >> >> >> >> +------ PCI-Express Device Error ------+ >> >> >> Error Severity : Uncorrected (Non-Fatal) >> >> >> PCIE Bus Error type : Transaction Layer >> >> >> Completion Timeout : Multiple >> >> >> Requester ID : 0028 >> >> >> VendorID=8086h, DeviceID=d13ah, Bus=00h, Device=05h, Function=00h >> >> >> igb: ge1_0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: >> >> >> RX/TX >> >> >> igb: ge1_1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: >> >> >> RX/TX >> >> >> >> >> >> I have added output of lspci -m and lspci -vvt . >> >> >> >> >> >> 1. How can we confirm this is s/w or hw problem ? >> >> >> 2. Any clue or hint on how to debug is really appreciated ? >> >> >> >> >> >> >> >> >> bash-3.2# lspci -m >> >> >> 00:00.0 "Class 0600" "Vendor 8086" "Device d130" -r11 "Unknown >> >> vendor >> >> >> 105b" "Device 0d61" >> >> >> 00:03.0 "Class 0604" "Vendor 8086" "Device d138" -r11 "" "" >> >> >> 00:05.0 "Class 0604" "Vendor 8086" "Device d13a" -r11 "" "" >> >> >> 00:08.0 "Class 0880" "Vendor 8086" "Device d155" -r11 "Unknown >> >> vendor >> >> >> 005b" "Device 0061" >> >> >> 00:08.1 "Class 0880" "Vendor 8086" "Device d156" -r11 "Unknown >> >> vendor >> >> >> 005b" "Device 0061" >> >> >> 00:08.2 "Class 0880" "Vendor 8086" "Device d157" -r11 "Unknown >> >> vendor >> >> >> 005b" "Device 0061" >> >> >> 00:08.3 "Class 0880" "Vendor 8086" "Device d158" -r11 "Unknown >> >> vendor >> >> >> 005b" "Device 0061" >> >> >> 00:10.0 "Class 0880" "Vendor 8086" "Device d150" -r11 "Unknown >> >> vendor >> >> >> 005b" "Device 0061" >> >> >> 00:10.1 "Class 0880" "Vendor 8086" "Device d151" -r11 "Unknown >> >> vendor >> >> >> 005b" "Device 0061" >> >> >> 00:1a.0 "Class 0c03" "Vendor 8086" "Device 3b3c" -r06 -p20 >> >> >> "Unknown vendor 105b" "Device 0d61" >> >> >> 00:1c.0 "Class 0604" "Vendor 8086" "Device 3b42" -r06 "" "" >> >> >> 00:1c.4 "Class 0604" "Vendor 8086" "Device 3b4a" -r06 "" "" >> >> >> 00:1c.5 "Class 0604" "Vendor 8086" "Device 3b4c" -r06 "" "" >> >> >> 00:1d.0 "Class 0c03" "Vendor 8086" "Device 3b34" -r06 -p20 >> >> >> "Unknown vendor 105b" "Device 0d61" >> >> >> 00:1e.0 "Class 0604" "Vendor 8086" "Device 244e" -ra6 -p01 "" "" >> >> >> 00:1f.0 "Class 0601" "Vendor 8086" "Device 3b16" -r06 "Unknown >> >> vendor >> >> >> 105b" "Device 0d61" >> >> >> 00:1f.2 "Class 0104" "Vendor 8086" "Device 2822" -r06 "Unknown >> >> vendor >> >> >> 105b" "Device 0d61" >> >> >> 00:1f.3 "Class 0c05" "Vendor 8086" "Device 3b30" -r06 "Unknown >> >> vendor >> >> >> 105b" "Device 0d61" >> >> >> 01:00.0 "Class 0604" "Vendor 10b5" "Device 8618" -rba "" "" >> >> >> 02:01.0 "Class 0604" "Vendor 10b5" "Device 8618" -rba "" "" >> >> >> 02:03.0 "Class 0604" "Vendor 10b5" "Device 8618" -rba "" "" >> >> >> 02:05.0 "Class 0604" "Vendor 10b5" "Device 8618" -rba "" "" >> >> >> 02:07.0 "Class 0604" "Vendor 10b5" "Device 8618" -rba "" "" >> >> >> 02:09.0 "Class 0604" "Vendor 10b5" "Device 8618" -rba "" "" >> >> >> 02:0b.0 "Class 0604" "Vendor 10b5" "Device 8618" -rba "" "" >> >> >> 02:0d.0 "Class 0604" "Vendor 10b5" "Device 8618" -rba "" "" >> >> >> 02:0f.0 "Class 0604" "Vendor 10b5" "Device 8618" -rba "" "" >> >> >> 03:00.0 "Class 0200" "Vendor 8086" "Device 10d3" "Unknown vendor >> >> 8086" >> >> >> "Device 0000" >> >> >> 04:00.0 "Class 0200" "Vendor 8086" "Device 10d3" "Unknown vendor >> >> 8086" >> >> >> "Device 0000" >> >> >> 05:00.0 "Class 0200" "Vendor 8086" "Device 10d3" "Unknown vendor >> >> 8086" >> >> >> "Device 0000" >> >> >> 06:00.0 "Class 0200" "Vendor 8086" "Device 10d3" "Unknown vendor >> >> 8086" >> >> >> "Device 0000" >> >> >> 07:00.0 "Class 0200" "Vendor 8086" "Device 10d3" "Unknown vendor >> >> 8086" >> >> >> "Device 0000" >> >> >> 08:00.0 "Class 0200" "Vendor 8086" "Device 10d3" "Unknown vendor >> >> 8086" >> >> >> "Device 0000" >> >> >> 09:00.0 "Class 0200" "Vendor 8086" "Device 10d3" "Unknown vendor >> >> 8086" >> >> >> "Device 0000" >> >> >> 0a:00.0 "Class 0200" "Vendor 8086" "Device 10d3" "Unknown vendor >> >> 8086" >> >> >> "Device 0000" >> >> >> 0b:00.0 "Class 0604" "Vendor 10b5" "Device 8624" -rbb "" "" >> >> >> 0c:04.0 "Class 0604" "Vendor 10b5" "Device 8624" -rbb "" "" >> >> >> 0c:05.0 "Class 0604" "Vendor 10b5" "Device 8624" -rbb "" "" >> >> >> 0c:08.0 "Class 0604" "Vendor 10b5" "Device 8624" -rbb "" "" >> >> >> 0c:09.0 "Class 0604" "Vendor 10b5" "Device 8624" -rbb "" "" >> >> >> 0e:00.0 "Class 0604" "Vendor 10b5" "Device 8518" -rac "" "" >> >> >> 0f:01.0 "Class 0604" "Vendor 10b5" "Device 8518" -rac "" "" >> >> >> 0f:02.0 "Class 0604" "Vendor 10b5" "Device 8518" -rac "" "" >> >> >> 10:00.0 "Class 0200" "Vendor 8086" "Device 10e6" -r01 "Unknown >> >> vendor >> >> >> 1374" "Device 0b60" >> >> >> 10:00.1 "Class 0200" "Vendor 8086" "Device 10e6" -r01 "Unknown >> >> vendor >> >> >> 1374" "Device 0b60" >> >> >> 11:00.0 "Class 0200" "Vendor 8086" "Device 10e6" -r01 "Unknown >> >> vendor >> >> >> 1374" "Device 0b60" >> >> >> 11:00.1 "Class 0200" "Vendor 8086" "Device 10e6" -r01 "Unknown >> >> vendor >> >> >> 1374" "Device 0b60" >> >> >> 12:00.0 "Class 0b40" "Vendor 1000" "Device 0a05" -r01 "Unknown >> >> vendor >> >> >> 1000" "Device 0a09" >> >> >> 14:00.0 "Class 1000" "Vendor 177d" "Device 0010" -r01 "Unknown >> >> vendor >> >> >> 177d" "Device 0001" >> >> >> 15:00.0 "Class 0200" "Vendor 8086" "Device 10d3" "Unknown vendor >> >> 8086" >> >> >> "Device 0000" >> >> >> 16:00.0 "Class 0604" "Vendor 1a03" "Device 1150" -r02 "" "" >> >> >> 17:00.0 "Class 0300" "Vendor 1a03" "Device 2000" -r10 "Unknown >> >> vendor >> >> >> 1a03" "Device 2000" >> >> >> >> >> >> >> >> >> bash-3.2# lspci -tvv >> >> >> -[0000:00]-+-00.0 Device 8086:d130 >> >> >> >> >> >> +-03.0-[0000:01-0a]----00.0-[0000:02-0a]--+-01.0-[0000:03]-- >> >> >> --00.0 >> >> >> Device 8086:10d3 >> >> >> | >> >> >> +-03.0-[0000:04]----00.0 Device 8086:10d3 >> >> >> | >> >> >> +-05.0-[0000:05]----00.0 Device 8086:10d3 >> >> >> | >> >> >> +-07.0-[0000:06]----00.0 Device 8086:10d3 >> >> >> | >> >> >> +-09.0-[0000:07]----00.0 Device 8086:10d3 >> >> >> | >> >> >> +-0b.0-[0000:08]----00.0 Device 8086:10d3 >> >> >> | >> >> >> +-0d.0-[0000:09]----00.0 Device 8086:10d3 >> >> >> | >> >> >> \-0f.0-[0000:0a]----00.0 Device 8086:10d3 >> >> >> +-05.0-[0000:0b-13]----00.0-[0000:0c-13]--+-04.0- >> >> [0000:0d]-- >> >> >> | >> >> >> +-05.0-[0000:0e-11]----00.0-[0000:0f-11]--+-01.0-[0000:10]--+- >> 00.0 >> >> >> Device 8086:10e6 >> >> >> | | >> >> >> | \-00.1 Device >> 8086:10e6 >> >> >> | | >> >> >> \-02.0-[0000:11]--+-00.0 Device >> 8086:10e6 >> >> >> | | >> >> >> \-00.1 Device >> 8086:10e6 >> >> >> | >> >> >> +-08.0-[0000:12]----00.0 Device 1000:0a05 >> >> >> | \-09.0- >> >> [0000:13]-- >> >> >> +-08.0 Device 8086:d155 >> >> >> +-08.1 Device 8086:d156 >> >> >> +-08.2 Device 8086:d157 >> >> >> +-08.3 Device 8086:d158 >> >> >> +-10.0 Device 8086:d150 >> >> >> +-10.1 Device 8086:d151 >> >> >> +-1a.0 Device 8086:3b3c >> >> >> +-1c.0-[0000:14]----00.0 Device 177d:0010 >> >> >> +-1c.4-[0000:15]----00.0 Device 8086:10d3 >> >> >> +-1c.5-[0000:16-17]----00.0-[0000:17]----00.0 Device >> >> >> 1a03:2000 >> >> >> +-1d.0 Device 8086:3b34 >> >> >> +-1e.0-[0000:18]-- >> >> >> +-1f.0 Device 8086:3b16 >> >> >> +-1f.2 Device 8086:2822 >> >> >> \-1f.3 Device 8086:3b30 >> >> >> >> >> >> >> >> >> Thanks, >> >> >> Ratheesh >> >> >> >> >> >> ----------------------------------------------------------------- >> - >> >> >> -- >> >> - >> >> >> -- >> >> >> ------- >> >> >> Everyone hates slow websites. So do we. >> >> >> Make your web apps faster with AppDynamics Download AppDynamics >> >> >> Lite for free today: >> >> >> http://p.sf.net/sfu/appdyn_d2d_feb >> >> >> _______________________________________________ >> >> >> E1000-devel mailing list >> >> >> E1000-devel@xxxxxxxxxxxxxxxxxxxxx >> >> >> https://lists.sourceforge.net/lists/listinfo/e1000-devel >> >> >> To learn more about Intel® Ethernet, visit >> >> >> http://communities.intel.com/community/wired -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html