I've had some disk problems on a server that is generally very stable. I thought all the lockups and data corruption I was seeing were the drive's fault, but after restoring corrupted file systems from backup twice this week, it dawmed on me that there appears to be a correlation between "smartctl -t long /dev/hdX" and the kernel freezing. The latest made it really clear that there's some problem on the kernel side. Hand-transcribed from a photo I took of the console after the crash: Call Trace: [<b0248228>] ata_output_data+0x4d/0x64 [<b024a60f>] ide_pio_sector+0xcd/0x102 [<b024af35>] ide_pio_datablock+0x46/0x5c [<b024b160>] pre_task_out_intr+0x9a/0xa5 [<b0246812>] ide_do_request+0x52b/0x6e0 [<b01cb5c3>] __generic_unplug_device+0x1d/0x1f [<b01cbe7b>] generic_unplug_device+0x6/0x8 [<b0263252>] unplug_slaves+0x4b/0x7a [<b02650c8>] raid1d+0xa51/0xac0 [<b026f123>] md_thread+0xd6/0xef [<b0120db7>] kthread+0x1d/0xda [<b0100b3d>] kernel_thread_helper+0x5/0xb DWARF2 unwinder stauck at kernel_thread_helper+0x5/0xb Leftover inexact backtrace: Code: c3 89 c2 ed c3 57 89 d7 89 c2 f3 6d 5f c3 89 d0 89 ca ee c3 0f b7 c0 66 ef c3 56 89 d6 89 c2 f3 66 6f 5e c3 ef c3 56 89 d6 89 c2 <f3> 6f 5e c3 c7 80 00 05 00 00 45 82 24 b0 c7 80 04 05 00 00 52 EIP: [<b024726f>] ide_outsl+0x5/0x9 SS:ESP 0068:eff4ddf0 <1>BUG: unable to handle kernel paging request at virtual address 30000200 printing eip: b024726f *pde = 00000000 Oops: 0000 [#2] CPU: 0 EIP: 0060:[<b024726f>] Not tainted VLI EFLAGS: 00010246 (2.6.18 #4) EIP is at ide_outsl+0x5/0x9 eax: 0000b000 ebx: b0444d18 ecx: 00000080 edx: 0000b000 esi: 30000200 edi: b0444d18 ebp: 00000080 esp: b0421f3c ds: 007b es: 007b ss: 0068 Process klogd (pid: 1146, ti=b0421000 task=eff08ad0 task.ti=b1a2b000) Stack: b0444dac b0248228 30000200 b0444d18 30000200 b0444dac b0444d18 b024a60f 00000020 00000200 00000001 0000000f b0444dac 00000001 b0444d18 b024af35 ffffffff b0444dac c2baef38 b024afc5 b190ef04 b0444dac 00000286 b190eee0 Call Trace: [<b0248228>] ata_output_data+0x4d/0x64 [<b024a60f>] ide_pio_sector+0xcd/0x102 [<b024af35>] ide_pio_datablock+0x46/0x5c [<b024afc5>] task_out_intr+0x7a/0x9c [<b02471e1>] ide_intr+0x13d/0x188 [<b012795e>] handle_IRQ_event+0x23/0x49 [<b01279e2>] __do_IRQ+0x5e/0xa4 [<b0142ca2>] do_IRQ+0x91/0xaf Code: c3 89 c2 ed c3 57 89 d7 89 c2 f3 6d 5f c3 89 d0 89 ca ee c3 0f b7 c0 66 ef c3 56 89 d6 89 c2 f3 66 6f 5e c3 ef c3 56 89 d6 89 c2 <f3> 6f 5e c3 c7 80 00 05 00 00 45 82 24 b0 c7 80 04 05 00 00 52 EIP: [<b024726f>] ide_outsl+0x5/0x9 SS:ESP 0068:b0421f3c <0>Kernel panic - not syncing: Fatal exception in interrupt This is an old 440BX motherboard that's been in continuous reliable service, 1 GB of ECC RAM, all partitions mirrored and very well cooled, good quality power supply and UPS, no recent hardware changes of any sort, etc. The active drives are on PDC20268 PCI controllers, one per channel. But it appears that if I try to run SMART self-tests while the system is up (which I have distant memories of being able to do with impunity), the system quickly locks up with disk corruption. It usually just reports lost interrupts, which I thought were the drive's fault and the IDE driver wasn't coping with too gracefully, but then I got the above panic, and that goes beyond "ungraceful". One of the drives *did* have a couple of bad blocks at the time; it's possible that the code path through RAID-1 recovery is somehow involved. Since this is actually an important server, I have to schedule reproducing it, and I'm not very eager to try unless I can manage it in a read-only mode. Recovering corrupted file systems twice in one week is Not Fun, expecially when the first was so bad it uncovered a bug in e2fsck. Well... I did just install a couple of big new drives (ordered when I thought this was purely a drive problem), so I can play with them. Perhaps I can image off the file systems that are at risk and then reproduce it. Does anyone have any particular ideas to investigate other than "git bisect drivers/ide drivers/md"? Thanks! - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html