We recently getting a lot of kernel oopses on one of our servers. It's acting as both an NFS server and client, usually running our latest NFSv4 code, so our first impulse was to assume the fault was ours. But we eventually noticed that one of the disks in the RAID 1 array that we're exporting had actually failed without our realizing it. Replacing the disk seemed to fix the problems. Of course we expect bad things to happen in that situation, but I assume a failed disk shouldn't cause kernel crashes. More details appended below, with sample oopses from our logs; let us know if any more information would be useful. Unfortunately, we need this machine for other work so we probably can't afford to swap the bad disk back in to reproduce the problem, but maybe this is of use to someone? --b. ----- Forwarded message from Kevin Coffman <kwc@xxxxxxxxxxxxxx> ----- Date: Thu, 27 Jul 2006 13:50:58 -0400 From: Kevin Coffman <kwc@xxxxxxxxxxxxxx> To: "J. Bruce Fields" <bfields@xxxxxxxxxxxx> Subject: screamer disk error fallout Cc: Olga Kornievskaia <aglo@xxxxxxxxxxxxxx>, Andy Adamson <andros@xxxxxxxxxxxxxx>, Kevin Coffman <kwc@xxxxxxxxxxxxxx> The solution seems to have been replacing a failed disk in the RAID 1 array, /dev/sdb. Raid controller: Adaptec 2005S Disk drives in array: Seagate ST373405LCV Kernel config is attached. Let me know what other info would be helpful. $ df -k Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 497829 302572 169555 65% / /dev/sda7 497829 8288 463839 2% /home none 254724 0 254724 0% /dev/shm /dev/sda9 295564 8250 272054 3% /tmp /dev/sda2 4127108 2054508 1862952 53% /usr /dev/sda3 4127108 178284 3739176 5% /usr/local /dev/sda5 4127076 249056 3668376 7% /usr/src /dev/sda8 497829 129403 342724 28% /var /dev/sdb1 69436796 25518876 40333824 39% /export/home /dev/sdc1 70557052 32816 66940140 1% /export/home/OSG-ITB/Data /dev/sdd1 17639220 32816 16710384 1% /export/home/OSG-ITB/Temp-shared /bakeathon 497829 302572 169555 65% /export/bakeathon troy:/vol/home 429496736 205484720 224012016 48% /nfs/home novi:/vol/backup 943718400 816932992 126785408 87% /nfs/backup $ $ /sbin/lspci 00:00.0 Host bridge: Broadcom CNB20HE Host Bridge (rev 23) 00:00.1 PCI bridge: Broadcom CNB20LE Host Bridge (rev 01) 00:00.2 Host bridge: Broadcom CNB20HE Host Bridge (rev 01) 00:00.3 Host bridge: Broadcom CNB20HE Host Bridge (rev 01) 00:03.0 RAID bus controller: Adaptec (formerly DPT) SmartRAID V Controller (rev 01) 00:04.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro 100] (rev 08) 00:06.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro 100] (rev 08) 00:0f.0 ISA bridge: Broadcom CSB5 South Bridge (rev 93) 00:0f.1 IDE interface: Broadcom CSB5 IDE Controller (rev 93) 00:0f.2 USB Controller: Broadcom OSB4/CSB5 OHCI USB Controller (rev 05) 00:0f.3 Host bridge: Broadcom CSB5 LPC bridge 01:00.0 VGA compatible controller: ATI Technologies Inc Rage XL AGP 2X (rev 27) 02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5701 Gigabit Ethernet (rev 15) $ A sampling of the oops ****************************************************************************************** dereference at virtual address 00000528 printing eip: c1055efa *pde = 1fe3a001 Oops: 0000 [#1] SMP CPU: 0 EIP: 0060:[<c1055efa>] Not tainted VLI EFLAGS: 00010206 (2.6.17-CITI_NFS4_ALL-1 #1) EIP is at sync_buffer+0xc/0x33 eax: 00000524 ebx: db073b10 ecx: db073b24 edx: c8f53a4c esi: db073b10 edi: c1e25888 ebp: c1055eee esp: db073acc ds: 007b es: 007b ss: 0068 Process nfsd (pid: 2877, threadinfo=db073000 task=db072ab0) Stack: c152a198 db073b10 c8f53a4c db073b0c 00000002 c152a235 00000002 c1055eee c1e25888 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000110 c8f53a4c 00000002 00000001 db072ab0 c102ba58 c1e25898 c1e25898 Call Trace: <c152a198> __wait_on_bit_lock+0x2a/0x52 <c152a235> out_of_line_wait_on_bit_lock+0x75/0x7d <c1055eee> sync_buffer+0x0/0x33 <c102ba58> wake_bit_function+0x0/0x3c <c105603c> __lock_buffer+0x21/0x24 <c10c7d70> journal_invalidatepage+0x8f/0x338 <c10bb490> ext3_invalidatepage+0x0/0x2d <c1054c1d> do_invalidatepage+0x16/0x18 <c103e62b> truncate_complete_page+0x18/0x3a <c103e6f1> truncate_inode_pages_range+0xa4/0x266 <c103e8bc> truncate_inode_pages+0x9/0xd <c10bb92a> ext3_delete_inode+0x13/0xba <c10bb917> ext3_delete_inode+0x0/0xba <c1068e99> generic_delete_inode+0x90/0xfc <c106895f> iput+0x64/0x66 <c1068457> d_delete+0x3c/0xcb <c105fdd2> vfs_unlink+0x96/0xb5 <c1117977> nfsd_unlink+0x1a2/0x1fa <c11226d2> nfsd4_proc_compound+0xe83/0x15ad <c14d4afb> ipt_do_table+0x2b7/0x2e0 <c1238118> copy_to_user+0x4a/0x5e <c1477d2c> memcpy_toiovec+0x27/0x4a <c104fdbd> cache_free_debugcheck+0x1f7/0x1ff <c1473cd6> release_sock+0x10/0x9b <c14a93ba> tcp_recvmsg+0x622/0x72b <c1473a1e> sock_common_recvmsg+0x2f/0x45 <c1471f8b> sock_recvmsg+0xc9/0xe4 <c102ba2b> autoremove_wake_function+0x0/0x2d <c1017096> activate_task+0x5a/0xa0 <c10173f2> try_to_wake_up+0x316/0x320 <c1016442> __wake_up_common+0x2f/0x53 <c1018091> __wake_up+0x2a/0x3d <c1508d8d> svc_sock_enqueue+0x1db/0x219 <c150a229> svc_tcp_recvfrom+0x672/0x6dd <c152b2f0> _spin_unlock_irq+0x5/0x7 <c152991f> schedule+0xa1b/0xa7c <c150d729> sunrpc_cache_lookup+0x4b/0xf9 <c1124a4d> nfsd4_decode_compound+0x2fe/0xce1 <c1125430> nfs4svc_decode_compoundargs+0x0/0x50 <c1114e8f> nfsd_dispatch+0xbb/0x170 <c150829d> svc_process+0x3b2/0x60d <c1115292> nfsd+0x190/0x2ea <c1115102> nfsd+0x0/0x2ea <c10016c5> kernel_thread_helper+0x5/0xb Code: 30 85 c0 75 07 89 d8 e8 90 ff ff ff f0 0f ba 33 03 89 d8 5b 5e 5f e9 d9 f0 ff ff 5b 5e 5f c3 f0 83 04 24 00 8b 40 1c 85 c0 74 1f <8b> 40 04 8b 80 e4 00 00 00 85 c0 74 12 8b 40 58 85 c0 74 0b 8b EIP: [<c1055efa>] sync_buffer+0xc/0x33 SS:ESP 0068:db073acc BUG: nfsd/2877, lock held at task exit time! [db2be340] {inode_init_once} .. held by: nfsd: 2877 [db072ab0, 116] ... acquired at: nfsd_unlink+0xd0/0x1fa BUG: unable to handle kernel paging request at virtual address 6b6b6b6b printing eip: 6b6b6b6b *pde = 6b6b6b6b Oops: 0000 [#2] SMP CPU: 1 EIP: 0060:[<6b6b6b6b>] Not tainted VLI EFLAGS: 00010012 (2.6.17-CITI_NFS4_ALL-1 #1) EIP is at 0x6b6b6b6b eax: db073b18 ebx: db073b18 ecx: 00000000 edx: 00000003 esi: 6b6b6b6b edi: 00000001 ebp: c1850e8c esp: c1850e6c ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, threadinfo=c1850000 task=dffcc550) Stack: c1016442 c1850ebc 00000003 c1e25888 6b6b6b6b c1e25888 c1850ebc 00000001 c1850eb0 c1018091 00000000 c1850ebc 00000003 00000296 c1e25888 00001000 c1056603 00000000 c102ba10 c1850ebc c9f53b4c 00000002 ca583f90 c1056631 Call Trace: <c1016442> __wake_up_common+0x2f/0x53 <c1018091> __wake_up+0x2a/0x3d <c1056603> end_bio_bh_io_sync+0x0/0x39 <c102ba10> __wake_up_bit+0x29/0x2e <c1056631> end_bio_bh_io_sync+0x2e/0x39 <c1058218> bio_endio+0x50/0x55 <c122938d> __end_that_request_first+0x184/0x478 <c104fdbd> cache_free_debugcheck+0x1f7/0x1ff <c13746d4> scsi_end_request+0x1e/0xad <c137497e> scsi_io_completion+0x21b/0x3f1 <c1402077> sd_rw_intr+0x27e/0x2a0 <c1370592> scsi_finish_command+0xb8/0xbd <c122abed> blk_done_softirq+0x5d/0x69 <c1020887> __do_softirq+0x58/0xc2 <c10056f2> do_softirq+0x46/0x50 ======================= <c10056a3> do_IRQ+0x72/0x7b <c1003c3a> common_interrupt+0x1a/0x20 <c10024a7> default_idle+0x0/0x55 <c10024d3> default_idle+0x2c/0x55 <c1002555> cpu_idle+0x59/0x6e Code: Bad EIP value. EIP: [<6b6b6b6b>] 0x6b6b6b6b SS:ESP 0068:c1850e6c <0>Kernel panic - not syncing: Fatal exception in interrupt BUG: warning at arch/i386/kernel/smp.c:537/smp_call_function() <c100d4a2> smp_call_function+0x52/0xc0 <c101ccda> printk+0x14/0x18 <c100d523> smp_send_stop+0x13/0x1c <c101c388> panic+0x45/0xdd <c10045e2> die+0x242/0x276 <c101263b> do_page_fault+0x512/0x60a <c1017096> activate_task+0x5a/0xa0 <c1012129> do_page_fault+0x0/0x60a <c1003d93> error_code+0x4f/0x54 <c1016442> __wake_up_common+0x2f/0x53 <c1018091> __wake_up+0x2a/0x3d <c1056603> end_bio_bh_io_sync+0x0/0x39 <c102ba10> __wake_up_bit+0x29/0x2e <c1056631> end_bio_bh_io_sync+0x2e/0x39 <c1058218> bio_endio+0x50/0x55 <c122938d> __end_that_request_first+0x184/0x478 <c104fdbd> cache_free_debugcheck+0x1f7/0x1ff <c13746d4> scsi_end_request+0x1e/0xad <c137497e> scsi_io_completion+0x21b/0x3f1 <c1402077> sd_rw_intr+0x27e/0x2a0 <c1370592> scsi_finish_command+0xb8/0xbd <c122abed> blk_done_softirq+0x5d/0x69 <c1020887> __do_softirq+0x58/0xc2 <c10056f2> do_softirq+0x46/0x50 ======================= <c10056a3> do_IRQ+0x72/0x7b <c1003c3a> common_interrupt+0x1a/0x20 <c10024a7> default_idle+0x0/0x55 <c10024d3> default_idle+0x2c/0x55 <c1002555> cpu_idle+0x59/0x6e ****************************************************************************************** printing eip: c10c7c6a *pde = 6b6b6b6b Oops: 0000 [#1] SMP CPU: 0 EIP: 0060:[<c10c7c6a>] Not tainted VLI EFLAGS: 00010a93 (2.6.17-CITI_NFS4_ALL-1 #2) EIP is at journal_invalidatepage+0x55/0x338 eax: 47006c63 ebx: 62755374 ecx: 00000002 edx: 00000002 esi: 00000001 edi: c19d6b30 ebp: 00000414 esp: db593b4c ds: 007b es: 007b ss: 0068 Process nfsd (pid: 2875, threadinfo=db592000 task=db58ea70) Stack: 00000000 c19d6b30 de760454 62755374 00000001 47006c63 c61a01cc c10bb3c3 00000008 c19d6b30 00000414 c1054b5d c19d6b30 c103e567 00000414 c103e62d 00000000 00000000 00000000 c5c1c8d4 00000000 ffffffff 0000000e 00000000 Call Trace: <c10bb3c3> ext3_invalidatepage+0x0/0x2d <c1054b5d> do_invalidatepage+0x16/0x18 <c103e567> truncate_complete_page+0x18/0x3a <c103e62d> truncate_inode_pages_range+0xa4/0x266 <c103e7f8> truncate_inode_pages+0x9/0xd <c10bb85d> ext3_delete_inode+0x13/0xba <c10bb84a> ext3_delete_inode+0x0/0xba <c1068dd9> generic_delete_inode+0x90/0xfc <c106889f> iput+0x64/0x66 <c1068397> d_delete+0x3c/0xcb <c105fd12> vfs_unlink+0x96/0xb5 <c11178b7> nfsd_unlink+0x1a2/0x1fa <c1122612> nfsd4_proc_compound+0xe83/0x15ad <c14d4a53> ipt_do_table+0x2b7/0x2e0 <c1238050> copy_to_user+0x4a/0x5e <c1477c8c> memcpy_toiovec+0x27/0x4a <c104fcf9> cache_free_debugcheck+0x1f7/0x1ff <c1473c36> release_sock+0x10/0x9b <c14a9312> tcp_recvmsg+0x622/0x72b <c147397e> sock_common_recvmsg+0x2f/0x45 <c1471eeb> sock_recvmsg+0xc9/0xe4 <c102b967> autoremove_wake_function+0x0/0x2d <c1016f9a> activate_task+0x5a/0xa0 <c10172f6> try_to_wake_up+0x316/0x320 <c1016346> __wake_up_common+0x2f/0x53 <c1017f95> __wake_up+0x2a/0x3d <c1508ce9> svc_sock_enqueue+0x1db/0x219 <c150a185> svc_tcp_recvfrom+0x672/0x6dd <c152b250> _spin_unlock_irq+0x5/0x7 <c152987f> schedule+0xa1b/0xa7c <c150d685> sunrpc_cache_lookup+0x4b/0xf9 <c112498d> nfsd4_decode_compound+0x2fe/0xce1 <c1125370> nfs4svc_decode_compoundargs+0x0/0x50 <c1114dcf> nfsd_dispatch+0xbb/0x170 <c15081f9> svc_process+0x3b2/0x60d <c11151d2> nfsd+0x190/0x2ea <c1115042> nfsd+0x0/0x2ea <c10016c5> kernel_thread_helper+0x5/0xb Code: 84 01 03 00 00 8b 02 f6 c4 08 75 08 0f 0b 6d 07 6e 35 5a c1 8b 44 24 04 8b 40 0c c7 44 24 10 01 00 00 00 89 44 24 18 89 c3 31 c0 <8b> 53 14 01 c2 89 54 24 14 8b 53 04 39 04 24 89 54 24 0c 0f 87 EIP: [<c10c7c6a>] journal_invalidatepage+0x55/0x338 SS:ESP 0068:db593b4c BUG: nfsd/2875, lock held at task exit time! [dc6aa710] {inode_init_once} .. held by: nfsd: 2875 [db58ea70, 115] ... acquired at: nfsd_unlink+0xd0/0x1fa ----- End forwarded message ----- - : send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html