Hello, I'm trying to get a quite standard "suse linux 9.2" setup working on a brand new dell poweredge 1850 with 2 scsi disks in raid1 setup. Installation went completely fine, everything is working. But now (and every time), after 2-3h of uptime and some high disk I/O load (rsync of some GB of data), it badly crashes with the following messages: ------------------------------------------------------------------- megaraid: aborting-1164069 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164069:48[255:0], fw owner megaraid: aborting-1164070 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164070:59[255:0], fw owner megaraid: aborting-1164071 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164071:19[255:0], fw owner megaraid: aborting-1164072 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164072:18[255:0], fw owner megaraid: aborting-1164073 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164073:20[255:0], fw owner megaraid: aborting-1164074 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164074:32[255:0], fw owner megaraid: aborting-1164075 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164075:13[255:0], fw owner megaraid: aborting-1164076 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164076:8[255:0], fw owner megaraid: aborting-1164077 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164076:8[255:0], fw owner megaraid: aborting-1164077 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164077:33[255:0], fw owner megaraid: aborting-1164078 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164078:60[255:0], fw owner megaraid: aborting-1164079 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164079:0[255:0], fw owner megaraid: aborting-1164080 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164080:63[255:0], fw owner megaraid: aborting-1164081 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164081:44[255:0], fw owner megaraid: aborting-1164082 cmd=2a <c=1 t=0 l=0> megaraid abort: 1164082:53[255:0], fw owner megaraid: reseting the host... megaraid: 14 outstanding commands. Max wait 180 sec megaraid mbox: Wait for 14 commands to complete:180 megaraid mbox: Wait for 14 commands to complete:175 megaraid mbox: Wait for 14 commands to complete:170 megaraid mbox: Wait for 14 commands to complete:165 megaraid mbox: Wait for 14 commands to complete:160 megaraid mbox: Wait for 14 commands to complete:155 megaraid mbox: Wait for 14 commands to complete:150 megaraid mbox: Wait for 14 commands to complete:145 megaraid mbox: Wait for 14 commands to complete:140 megaraid mbox: Wait for 14 commands to complete:135 megaraid mbox: Wait for 14 commands to complete:130 megaraid mbox: Wait for 14 commands to complete:125 megaraid mbox: Wait for 14 commands to complete:120 megaraid mbox: Wait for 14 commands to complete:115 megaraid mbox: Wait for 14 commands to complete:110 megaraid mbox: Wait for 14 commands to complete:105 megaraid mbox: Wait for 14 commands to complete:100 megaraid mbox: Wait for 14 commands to complete:95 megaraid mbox: Wait for 14 commands to complete:90 megaraid mbox: Wait for 14 commands to complete:85 megaraid mbox: Wait for 14 commands to complete:80 megaraid mbox: Wait for 14 commands to complete:75 megaraid mbox: Wait for 14 commands to complete:70 megaraid mbox: Wait for 14 commands to complete:65 megaraid mbox: Wait for 14 commands to complete:60 megaraid mbox: Wait for 14 commands to complete:55 megaraid mbox: Wait for 14 commands to complete:50 megaraid mbox: Wait for 14 commands to complete:45 megaraid mbox: Wait for 14 commands to complete:40 megaraid mbox: Wait for 14 commands to complete:35 megaraid mbox: Wait for 14 commands to complete:30 megaraid mbox: Wait for 14 commands to complete:25 megaraid mbox: Wait for 14 commands to complete:20 megaraid mbox: Wait for 14 commands to complete:15 megaraid mbox: Wait for 14 commands to complete:10 megaraid mbox: Wait for 14 commands to complete:5 megaraid mbox: Wait for 14 commands to complete:0 megaraid mbox: critical hardware error! megaraid: reseting the host... megaraid: hw error, cannot reset megaraid: reseting the host... megaraid: hw error, cannot reset scsi: Device offlined - not ready after error recovery: host 0 channel 1 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 1 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 1 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 1 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 1 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 1 id 0 lun 0 scsi: Device offlined - not ready after error recovery: host 0 channel 1 id 0 lun 0 [...] scsi: Device offlined - not ready after error recovery: host 0 channel 1 id 0 lun 0 SCSI error : <0 1 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 105704481 Buffer I/O error on device sda8, logical block 855051 lost page write due to I/O error on sda8 scsi0 (0:0): rejecting I/O to offline device Buffer I/O error on device sda8, logical block 855052 lost page write due to I/O error on sda8 Buffer I/O error on device sda8, logical block 855053 lost page write due to I/O error on sda8 Buffer I/O error on device sda8, logical block 855054 lost page write due to I/O error on sda8 Buffer I/O error on device sda8, logical block 855060 lost page write due to I/O error on sda8 SCSI error : <0 1 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 105704609 scsi0 (0:0): rejecting I/O to offline device SCSI error : <0 1 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 105704737 scsi0 (0:0): rejecting I/O to offline device SCSI error : <0 1 0 0> return code = 0x6000000 [...] scsi0 (0:0): rejecting I/O to offline device SCSI error : <0 1 0 0> return code = 0x6000000 end_request: I/O error, dev sda, sector 105705889 scsi0 (0:0): rejecting I/O to offline device scsi0 (0:0): rejecting I/O to offline device scsi0 (0:0): rejecting I/O to offline device scsi0 (0:0): rejecting I/O to offline device EXT3-fs error (device sda5) in ext3_reserve_inode_write: IO failure scsi0 (0:0): rejecting I/O to offline device scsi0 (0:0): rejecting I/O to offline device EXT3-fs error (device sda5) in ext3_dirty_inode: IO failure scsi0 (0:0): rejecting I/O to offline device ext3_abort called. EXT3-fs error (device sda5): ext3_journal_start: Detected aborted journal Remounting filesystem read-only [...] ------------------------------------------------------------------- And then, complete crash, system not reacting anymore. Not really nice, isn't it? :) Now I'm trying to find a solution... In the meantime, if you already saw somthing like that, feedback/pointers would be very welcome. Merci! I will try with knoppix and some *BSD, but the chances that the HW is really bad are low: on reboot everything runs completely fine, for some hours... A consistancy check of the RAID array took about 1h, but reported no problems. Some more infos: Loaded modules: ext3 128744 5 jbd 76964 1 ext3 megaraid_mbox 35216 6 megaraid_mm 14752 1 megaraid_mbox sd_mod 22144 7 scsi_mod 121412 5 sg,st,sr_mod,megaraid_mbox,sd_mod # uname -a Linux pe1850 2.6.8-24.10-smp #1 SMP Wed Dec 22 11:54:27 UTC 2004 i686 i686 i386 GNU/Linux dmesg messages about scsi subsystem: SCSI subsystem initialized megaraid cmm: 2.20.2.0 (Release Date: Thu Aug 19 09:58:33 EDT 2004) megaraid: 2.20.4.0 (Release Date: Mon Sep 27 22:15:07 EDT 2004) megaraid: probe new device 0x1028:0x0013:0x1028:0x016c: bus 2:slot 14:func 0 ACPI: PCI interrupt 0000:02:0e.0[A] -> GSI 46 (level, low) -> IRQ 201 megaraid: fw version:[513O] bios version:[H418] scsi0 : LSI Logic MegaRAID driver scsi[0]: scanning scsi channel 0 [Phy 0] for non-raid devices Vendor: PE/PV Model: 1x2 SCSI BP Rev: 1.0 Type: Processor ANSI SCSI revision: 02 scsi[0]: scanning scsi channel 1 [virtual] for logical drives Vendor: MegaRAID Model: LD 0 RAID1 69G Rev: 513O Type: Direct-Access ANSI SCSI revision: 02 SCSI device sda: 143114240 512-byte hdwr sectors (73274 MB) sda: asking for cache data failed sda: assuming drive cache: write through sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 sda8 > Attached scsi disk sda at scsi0, channel 1, id 0, lun 0 regards, Olivier -- _______________________________________________________ Olivier Müller - PGP key ID: 0x0E84D2EA - Switzerland E-Mail: http://omx.ch/mail/ - AIM/iChat: swix3k - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html