AMD64 Dual Opteron system with a self-compiled kernel from stock sources,
Debian base.
SATA card is a Promise SATAII300 TX4, here is the lspci entry:
0000:01:03.0 Unknown mass storage controller: Promise Technology, Inc.: Unknown device 3d17 (rev 02)
I was tracking 2.6.16 all the way up to .27, when it showed this
behavior. I then tried tried 2.6.17.13, and got the same result. The logs
below are from the attempt with 2.6.17.13.
During normal system operation, with no load, I tried pulling a SATA drive
from the promise controller (in this case "ata1" which is sda) in order to
test the system being able to handle a drive failure.
I immediately got this in the syslog:
Oct 2 16:14:43 stock kernel: irq 16: nobody cared (try booting with the "irqpoll" option)
Oct 2 16:14:43 stock kernel:
Oct 2 16:14:43 stock kernel: Call Trace: <IRQ> <ffffffff80299dd0>{__report_bad_irq+48}
Oct 2 16:14:43 stock kernel: <ffffffff80299eec>{note_interrupt+140} <ffffffff80299584>{__do_IRQ+196}
Oct 2 16:14:43 stock kernel: <ffffffff8025f352>{do_IRQ+66} <ffffffff8025c540>{default_idle+0}
Oct 2 16:14:43 stock kernel: <ffffffff802545b8>{ret_from_intr+0} <EOI> <ffffffff80256ce0>{thread_return+0}
Oct 2 16:14:43 stock kernel: <ffffffff8025c56f>{default_idle+47} <ffffffff8024092b>{cpu_idle+107}
Oct 2 16:14:43 stock kernel: handlers:
Oct 2 16:14:43 stock kernel: [pdc_interrupt+0/512] (pdc_interrupt+0x0/0x200)
Oct 2 16:14:43 stock kernel: Disabling IRQ #16
After this, all four of the disks on the Promise controller stopped working,
I am assuming because of the interrupt being disabled, even though the other
three should have been just fine.
This is a partial log of console errors after the IRQ was disabled (all that
I was able to grab since I couldn't access any disks anymore).
ata1: command timeout
ata1: translated ATA stat/err 0xff/00 to SCSI SK/ASC/ASCQ 0xb/47/00
ata1: status=0xff { Busy }
sd 0:0:0:0: SCSI error: return code = 0x8000002
sda: Current: sense key: Aborted Command
Additional sense: Scsi parity error
end_request: I/O error, dev sda, sector 976767923
ata4: command timeout
ata4: status=0x50 { DriveReady SeekComplete }
sdd: Current<3>ata2: command timeout
ata2: status=0x50 { DriveReady SeekComplete }
sdb: Current: sense key: No Sense
Additional sense: No additional sense information
Info fld=0x301036
: sense key: No Sense
Additional sense: No additional sense information
Info fld=0x301036
ata3: command timeout
ata3: status=0x50 { DriveReady SeekComplete }
sdc: Current: sense key: No Sense
Additional sense: No additional sense information
Info fld=0x301036
sdc: Current: sense key: No Sense
Additional sense: No additional sense information
ata4: command timeout
ata4: status=0x50 { DriveReady SeekComplete }
ata2: command timeout
ata2: status=0x50 { DriveReady SeekComplete }
ata3: command timeout
ata3: status=0x50 { DriveReady SeekComplete }
ata4: command timeout
ata4: status=0x50 { DriveReady SeekComplete }
sdd: Current<3>ata2: command timeout
ata2: status=0x50 { DriveReady SeekComplete }
sdb: Current: sense key: No Sense
Additional sense: No additional sense information
Info fld=0x917e
: sense key: No Sense
Additional sense: No additional sense information
Info fld=0x917e
ata3: command timeout
ata3: status=0x50 { DriveReady SeekComplete }
sdc: Current: sense key: No Sense
Additional sense: No additional sense information
Info fld=0x917e
Similar messages continued to be printed to the console, and any disk
accesses continued to fail. I waited about a half hour to see if the system
would ever recover, and it never did.
I'd be happy to provide any additional information that may be helpful.
Evan
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html