SATA failure, 2.6.16.16

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Linux e7 2.6.16.16 #1 PREEMPT Thu May 11 23:15:49 EST 2006 i686 GNU/Linux

Had a massive SATA failure, a 5 disk raid5 failed and gone into an endless
look of errors. When I remounted the fs readonly the raid finally shut down.
I then managed to cleanly reboot, No problem bringing up the raid (the kicked
out disk was rebuilt). Data looks OK so far.

I must say that this setup was going fine for a few months now, and I really
hoped that these sata problems are behind me (it only happens to others...)
now.

I should mention that I noticed many stuck processed (naturally) and some were
'smartctl', which may be relevant?

Saw this on all consoles/xterms. The first line point to what probably
started it all.

eyal kernel: [6016706.665000] Disabling IRQ #21
eyal kernel: [6016857.113000] Oops: 0002 [#1]
eyal kernel: [6016857.113000] PREEMPT
eyal kernel: [6016857.113000] CPU:    0
eyal kernel: [6016857.113000] EIP is at ata_pio_poll+0x8c/0xfa [libata]
eyal kernel: [6016857.113000] eax: 66a1f3e7   ebx: f7484280   ecx: f7d5e570   edx: f74a1000
eyal kernel: [6016857.113000] esi: 00000000   edi: 00000004   ebp: 00000002   esp: f74a1ee4
eyal kernel: [6016857.113000] ds: 007b   es: 007b   ss: 0068
eyal kernel: [6016857.113000] Process ata/0 (pid: 2551, threadinfo=f74a1000 task=f7d5e570)
eyal kernel: [6016857.113000] Stack: <0>f7484280 f89cafb1 f89cad09 f89c9dd7 00000b51 d5a7b030 f7484280 00000000
eyal kernel: [6016857.113000]        f74a1000 f7434240 f89c58f0 f7484280 f7484830 f748482c c0128a89 f7484280
eyal kernel: [6016857.113000]        00000000 d5a7b030 f7434258 f7434248 f7484280 f89c58ab 00000282 f7434250
eyal kernel: [6016857.113000] Call Trace:
eyal kernel: [6016857.113000]  [pg0+945510640/1069474816] ata_pio_task+0x45/0x6e [libata]
eyal kernel: [6016857.113000]  [run_workqueue+138/288] run_workqueue+0x8a/0x120
eyal kernel: [6016857.113000]  [pg0+945510571/1069474816] ata_pio_task+0x0/0x6e [libata]
eyal kernel: [6016857.113000]  [worker_thread+320/354] worker_thread+0x140/0x162
eyal kernel: [6016857.113000]  [default_wake_function+0/18] default_wake_function+0x0/0x12
eyal kernel: [6016857.113000]  [default_wake_function+0/18] default_wake_function+0x0/0x12
eyal kernel: [6016857.113000]  [worker_thread+0/354] worker_thread+0x0/0x162
eyal kernel: [6016857.113000]  [kthread+177/183] kthread+0xb1/0xb7
eyal kernel: [6016857.113000]  [kthread+0/183] kthread+0x0/0xb7
eyal kernel: [6016857.113000]  [kernel_thread_helper+5/11] kernel_thread_helper+0x5/0xb
eyal kernel: [6016857.113000] Code: 78 1c 89 bb dc 05 00 00 31 c0 8b 5c 24 18 8b 74 24 1c 8b 7c 24 20 8b 6c 24 24 83 c4 28 c3 a1 58 91 30 c0 39 83 e0 05 00 00 79 13 <83> 8e 8c 00 00 00 04 c7 83 dc 05 00 00 03 00 00 00 eb ca b8 10


This is in 'messages':

May 15 21:46:14 eyal kernel: [6016706.665000]  [__report_bad_irq+42/143] __report_bad_irq+0x2a/0x8f
May 15 21:46:14 eyal kernel: [6016706.665000]  [handle_IRQ_event+46/100] handle_IRQ_event+0x2e/0x64
May 15 21:46:14 eyal kernel: [6016706.665000]  [note_interrupt+124/223] note_interrupt+0x7c/0xdf
May 15 21:46:14 eyal kernel: [6016706.665000]  [__do_IRQ+218/241] __do_IRQ+0xda/0xf1
May 15 21:46:14 eyal kernel: [6016706.665000]  [do_IRQ+62/94] do_IRQ+0x3e/0x5e
May 15 21:46:14 eyal kernel: [6016706.665000]  =======================
May 15 21:46:14 eyal kernel: [6016706.665000]  [common_interrupt+26/32] common_interrupt+0x1a/0x20
May 15 21:46:14 eyal kernel: [6016706.665000]  [get_offset_pmtmr+16/83] get_offset_pmtmr+0x10/0x53
May 15 21:46:14 eyal kernel: [6016706.665000]  [do_gettimeofday+24/174] do_gettimeofday+0x18/0xae
May 15 21:46:14 eyal kernel: [6016706.665000]  [getnstimeofday+23/50] getnstimeofday+0x17/0x32
May 15 21:46:14 eyal kernel: [6016706.665000]  [do_gettimeofday+24/174] do_gettimeofday+0x18/0xae
May 15 21:46:14 eyal kernel: [6016706.665000]  [ktime_get_ts+27/90] ktime_get_ts+0x1b/0x5a
May 15 21:46:14 eyal kernel: [6016706.665000]  [getnstimeofday+23/50] getnstimeofday+0x17/0x32
May 15 21:46:14 eyal kernel: [6016706.665000]  [ktime_get+27/75] ktime_get+0x1b/0x4b
May 15 21:46:14 eyal kernel: [6016706.665000]  [hrtimer_run_queues+40/289] hrtimer_run_queues+0x28/0x121
May 15 21:46:14 eyal kernel: [6016706.665000]  [run_timer_softirq+12/507] run_timer_softirq+0xc/0x1fb
May 15 21:46:14 eyal kernel: [6016706.665000]  [__do_softirq+126/138] __do_softirq+0x7e/0x8a
May 15 21:46:14 eyal kernel: [6016706.665000]  [do_softirq+65/80] do_softirq+0x41/0x50
May 15 21:46:14 eyal kernel: [6016706.665000]  =======================
May 15 21:46:14 eyal kernel: [6016706.665000]  [irq_exit+54/56] irq_exit+0x36/0x38
May 15 21:46:14 eyal kernel: [6016706.665000]  [do_IRQ+69/94] do_IRQ+0x45/0x5e
May 15 21:46:14 eyal kernel: [6016706.665000]  [common_interrupt+26/32] common_interrupt+0x1a/0x20

May 15 21:48:44 eyal kernel: [6016857.113000] Modules linked in: isofs zlib_inflate nls_iso8859_1 cifs nvidia tsdev loop psmouse v4l1_compat dvb_bt8xx nxt6000 mt352 dvb_p
ll sp887x dst_ca dst dvb_core cx24110 or51211 lgdt330x i810_audio ac97_codec rtc it87 hwmon_vid hwmon eeprom i2c_isa i2c_i801 raid5 xor eth1394 ide_cd cdrom ns558 gamepor
t snd_mpu401 snd_mpu401_uart snd_rawmidi snd_seq_device parport_pc parport ohci_hcd ohci1394 ieee1394 dc395x bt878 bttv tuner video_buf firmware_class compat_ioctl32 i2c_
algo_bit v4l2_common btcx_risc ir_common tveeprom videodev sata_promise e1000 snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd snd_page_alloc soundcore i2c_
core ata_piix libata ehci_hcd uhci_hcd usbcore shpchp pci_hotplug intel_agp agpgart ext3 jbd nls_cp437 msdos fat sd_mod scsi_mod md_mod dm_mod unix
May 15 21:48:44 eyal kernel: [6016857.113000] EIP:    0060:[pg0+945508257/1069474816]    Tainted: P      VLI
	>>> tainted by the binary nvidia driver
May 15 21:48:44 eyal kernel: [6016857.113000] EFLAGS: 00010283   (2.6.16.11 #1)

Followed by an endless repeats like this:

May 15 21:50:33 eyal kernel: [6016966.144000] ata3: status=0xff { Busy }
May 15 21:50:33 eyal kernel: [6016966.207000] ATA: abnormal status 0xFF on port 0xF899A21C
May 15 21:50:33 eyal last message repeated 2 times
May 15 21:50:33 eyal kernel: [6016966.732000] ATA: abnormal status 0xFF on port 0xF899A29C
May 15 21:50:33 eyal kernel: [6016966.732000] ata4: status=0xff { Busy }
May 15 21:50:33 eyal kernel: [6016966.732000] sd 3:0:0:0: SCSI error: return code = 0x8000002
May 15 21:50:33 eyal kernel: [6016966.732000] sdd: Current: sense key: Aborted Command
May 15 21:50:33 eyal kernel: [6016966.732000]     Additional sense: Scsi parity error
May 15 21:50:33 eyal kernel: [6016966.732000] end_request: I/O error, dev sdd, sector 428951623
May 15 21:50:33 eyal kernel: [6016966.871000] ATA: abnormal status 0xFF on port 0xF899A31C
May 15 21:50:33 eyal kernel: [6016966.871000] ata5: status=0xff { Busy }
May 15 21:50:33 eyal kernel: [6016966.871000] sd 4:0:0:0: SCSI error: return code = 0x8000002
May 15 21:50:33 eyal kernel: [6016966.871000] sde: Current: sense key: Aborted Command
May 15 21:50:33 eyal kernel: [6016966.871000]     Additional sense: Scsi parity error
May 15 21:50:33 eyal kernel: [6016966.871000] end_request: I/O error, dev sde, sector 428954767
May 15 21:50:42 eyal kernel: [6016975.874000] ATA: abnormal status 0x58 on port 0xC007
May 15 21:50:42 eyal last message repeated 2 times
May 15 21:50:43 eyal kernel: [6016976.207000] ata3: status=0xff { Busy }
May 15 21:51:02 eyal kernel: [6016995.798000] ATA: abnormal status 0x58 on port 0xC007
May 15 21:51:02 eyal last message repeated 3 times
May 15 21:51:03 eyal kernel: [6016996.871000] ATA: abnormal status 0xFF on port 0xF899A29C
May 15 21:51:03 eyal kernel: [6016996.871000] ata4: status=0xff { Busy }
May 15 21:51:03 eyal kernel: [6016996.871000] sd 3:0:0:0: SCSI error: return code = 0x8000002
May 15 21:51:03 eyal kernel: [6016996.871000] sdd: Current: sense key: Aborted Command
May 15 21:51:03 eyal kernel: [6016996.871000]     Additional sense: Scsi parity error
May 15 21:51:03 eyal kernel: [6016996.871000] end_request: I/O error, dev sdd, sector 428951631
May 15 21:51:04 eyal kernel: [6016997.036000] ATA: abnormal status 0xFF on port 0xF899A31C
May 15 21:51:04 eyal kernel: [6016997.036000] ata5: status=0xff { Busy }
May 15 21:51:04 eyal kernel: [6016997.036000] sd 4:0:0:0: SCSI error: return code = 0x8000002
May 15 21:51:04 eyal kernel: [6016997.036000] sde: Current: sense key: Aborted Command
May 15 21:51:04 eyal kernel: [6016997.036000]     Additional sense: Scsi parity error
May 15 21:51:04 eyal kernel: [6016997.036000] end_request: I/O error, dev sde, sector 428954775
May 15 21:51:12 eyal kernel: [6017005.878000] ATA: abnormal status 0x58 on port 0xC807
May 15 21:51:12 eyal last message repeated 2 times
May 15 21:51:22 eyal kernel: [6017015.975000] ATA: abnormal status 0xFF on port 0xF899A21C
May 15 21:51:22 eyal last message repeated 2 times
May 15 21:51:32 eyal kernel: [6017025.975000] ata3: status=0xff { Busy }
May 15 21:51:32 eyal kernel: [6017026.038000] ATA: abnormal status 0xFF on port 0xF899A21C
May 15 21:51:32 eyal last message repeated 2 times
May 15 21:51:33 eyal kernel: [6017027.010000] ATA: abnormal status 0xFF on port 0xF899A29C
May 15 21:51:33 eyal kernel: [6017027.010000] ata4: status=0xff { Busy }
May 15 21:51:33 eyal kernel: [6017027.010000] sd 3:0:0:0: SCSI error: return code = 0x8000002
May 15 21:51:33 eyal kernel: [6017027.010000] sdd: Current: sense key: Aborted Command
May 15 21:51:33 eyal kernel: [6017027.010000]     Additional sense: Scsi parity error
May 15 21:51:33 eyal kernel: [6017027.010000] end_request: I/O error, dev sdd, sector 428951639
May 15 21:51:34 eyal kernel: [6017027.175000] ATA: abnormal status 0xFF on port 0xF899A31C
May 15 21:51:34 eyal kernel: [6017027.175000] ata5: status=0xff { Busy }
May 15 21:51:34 eyal kernel: [6017027.175000] sd 4:0:0:0: SCSI error: return code = 0x8000002
May 15 21:51:34 eyal kernel: [6017027.175000] sde: Current: sense key: Aborted Command
May 15 21:51:34 eyal kernel: [6017027.175000]     Additional sense: Scsi parity error
May 15 21:51:34 eyal kernel: [6017027.175000] end_request: I/O error, dev sde, sector 428954783
May 15 21:51:43 eyal kernel: [6017036.038000] ata3: status=0xff { Busy }
May 15 21:52:02 eyal kernel: [6017055.815000] ATA: abnormal status 0x58 on port 0xC007
May 15 21:52:02 eyal last message repeated 2 times
May 15 21:52:04 eyal kernel: [6017057.149000] ATA: abnormal status 0xFF on port 0xF899A29C
May 15 21:52:04 eyal kernel: [6017057.149000] ata4: status=0xff { Busy }
May 15 21:52:04 eyal kernel: [6017057.149000] sd 3:0:0:0: SCSI error: return code = 0x8000002
May 15 21:52:04 eyal kernel: [6017057.149000] sdd: Current: sense key: Aborted Command
May 15 21:52:04 eyal kernel: [6017057.149000]     Additional sense: Scsi parity error
May 15 21:52:04 eyal kernel: [6017057.149000] end_request: I/O error, dev sdd, sector 428951647
May 15 21:52:04 eyal kernel: [6017057.314000] ATA: abnormal status 0xFF on port 0xF899A31C

All the errors are on ata[345] which are on a Promise SATA II 150 TX4.

$ lspci
0000:00:00.0 Host bridge: Intel Corp. 82865G/PE/P DRAM Controller/Host-Hub Interface (rev 02)
0000:00:01.0 PCI bridge: Intel Corp. 82865G/PE/P PCI to AGP Controller (rev 02)
0000:00:03.0 PCI bridge: Intel Corp. 82865G/PE/P PCI to CSA Bridge (rev 02)
0000:00:1d.0 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI #1 (rev 02)
0000:00:1d.1 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI #2 (rev 02)
0000:00:1d.2 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI #3 (rev 02)
0000:00:1d.3 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI #4 (rev 02)
0000:00:1d.7 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02)
0000:00:1e.0 PCI bridge: Intel Corp. 82801 PCI Bridge (rev c2)
0000:00:1f.0 ISA bridge: Intel Corp. 82801EB/ER (ICH5/ICH5R) LPC Bridge (rev 02)
0000:00:1f.1 IDE interface: Intel Corp. 82801EB/ER (ICH5/ICH5R) Ultra ATA 100 Storage Controller (rev 02)
0000:00:1f.2 IDE interface: Intel Corp. 82801EB (ICH5) Serial ATA 150 Storage Controller (rev 02)
0000:00:1f.3 SMBus: Intel Corp. 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02)
0000:00:1f.5 Multimedia audio controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) AC'97 Audio Controller (rev 02)
0000:01:00.0 VGA compatible controller: nVidia Corporation NV31 [GeForce FX 5600XT] (rev a1)
0000:02:01.0 Ethernet controller: Intel Corp. 82547EI Gigabit Ethernet Controller (LOM)
0000:03:01.0 Unknown mass storage controller: Promise Technology, Inc.: Unknown device 3d18 (rev 02)
0000:03:02.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture (rev 11)
0000:03:02.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 11)
0000:03:03.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture (rev 11)
0000:03:03.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 11)
0000:03:04.0 SCSI storage controller: Tekram Technology Co.,Ltd. TRM-S1040 (rev 01)
0000:03:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller (PHY/Link)

-- 
Eyal Lebedinsky (eyal@xxxxxxxxxxxxxx) <http://samba.org/eyal/>
	attach .zip as .dat
-
: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux