Hi, > Besides the external storage (powered by megaraid, the PERC4 > and the PV220s) the machine has two internal SATA drives. > These internal drives house the OS, the web server and the > mail queue. The only I/O running through megaraid at the > time of the failures has been the creation of the tar files. If it is happening during disk I/O, I would like to investigate further. It would be greatly helpful if you could provide some detail steps to get the issue including how to create that big size file. And also, I'll check with F/W team to see if any updated version of it and will get back to you if so. Thank you, again. > -----Original Message----- > From: Collins, Kevin [mailto:kCollins@xxxxxxxxxxxxxxxxxxxxxx] > Sent: Friday, January 13, 2006 11:00 AM > To: linux-scsi@xxxxxxxxxxxxxxx; Ju, Seokmann > Subject: RE: Megaraid problems. > > On Friday, January 13, 2006 9:39 AM, Seokmann Ju wrote: > > > Hi, > Hey, glad to get a response! :-) > > > Thank you for posting details regarding megaraid. > > From the log, the messaage are OK except for following. > > --- > > > 1 Time(s): [5535381.561000] megaraid: reseting the host... > > > 1 Time(s): [5535386.566000] megaraid mbox: Wait for 2 > commands to > > > complete:175 > > > > > > [... The above line repeat every 5 seconds, counting down > to 0 ...] > > > > > > 1 Time(s): [5535556.736000] megaraid mbox: Wait for 2 > commands to > > > complete:5 > > > 2 Time(s): [5535562.611000] scsi2 (0:0): rejecting I/O > to offline > > > device > > --- > > > a). Knows about the problem and is working on it. > > > > > > - and, more importantly - > > It seems like that, for some reason, controller couldn't > > return commands (2 commands for this case) within given > > timeout period. > > And because of it, driver decided to reset the controller and > > as part of reset, it triggers the F/W to make the device offline. > > And I'm assuming that this is why my data isn't damamged or > otherwise corrupted - which is a good thing! ;-) > > > > b). Can lead me to a fix. > > Can you clarify what is F/W version on the controller? > Firmware on the controller (from /proc/scci/scsi): 351S > ================================================================= > Host: scsi2 Channel: 00 Id: 06 Lun: 00 > Vendor: DELL Model: PV22XS Rev: E.18 > Type: Processor ANSI SCSI revision: 03 > Host: scsi2 Channel: 01 Id: 00 Lun: 00 > Vendor: MegaRAID Model: LD 0 RAID5 858G Rev: 351S > Type: Direct-Access ANSI SCSI revision: 02 > Host: scsi2 Channel: 01 Id: 01 Lun: 00 > Vendor: MegaRAID Model: LD 1 RAID5 858G Rev: 351S > Type: Direct-Access ANSI SCSI revision: 02 > ================================================================= > > I have seen reports on Dells mailing list that elude to the > fact the the E18 and 351S firmwares are supposed to help this > situation, but not in my case. My system shipped with these > firmwares in place. Dell, to my knowledge, does not offer > any newer versions of either firmware. > > > Besides disk I/O, are there other operations involved like, > tape R/W? > No tape R/W, but... > > > How about application? Any application that is communicating > > with MegaRAID through IOCTL at that time? > As for other tasks, the machine also serves as a web server > (Apache, MySQL and PHP) and e-mail relay (Postfix). The mail > relay does more work than the web server, but even that is light. > > Besides the external storage (powered by megaraid, the PERC4 > and the PV220s) the machine has two internal SATA drives. > These internal drives house the OS, the web server and the > mail queue. The only I/O running through megaraid at the > time of the failures has been the creation of the tar files. > > > > > Thank you, > > You're welcome. I hope I have helped with the information > and not hindered. ;-) > > Kevin > > > > > > > > -----Original Message----- > > > From: linux-scsi-owner@xxxxxxxxxxxxxxx > > > [mailto:linux-scsi-owner@xxxxxxxxxxxxxxx] On Behalf Of > > Collins, Kevin > > > Sent: Friday, January 13, 2006 9:05 AM > > > To: linux-scsi@xxxxxxxxxxxxxxx > > > Subject: Megaraid problems. > > > > > > Hi list, > > > > > > I have a Dell PowerEdge 850 with their PERC4sc RAID card > driving a > > > Dell PowerVault 220s external drive enclosure running > Ubuntu 5.10. > > > This machine and all the parts that make it up are less > > than 2 months > > > old. In that time, I have had both logical drives supplied > > by PV220s > > > taken offline by the megaraid driver twice. The only > cure for this > > > has been a reboot of the machine. Luckily, with the > > exception of the > > > process that was running at the time of the problem, > > nothing else was > > > damaged or hurt; no loss of data has been experienced (yet). > > > > > > Both times the failure has occurred, it happened while creating a > > > gzipped tarball of some backup data. The final tarball > created is > > > averaging about 92+ GB in size and the machine is under > > heavy disk I/O > > > for more than 7 hours. I have been able to grab this > > information from > > > the syslog after the failure (gathered with LogWatch): > > > > > > 1 Time(s): [5535381.561000] megaraid abort: > > > 55592075:43[255:128], fw owner > > > 1 Time(s): [5535381.561000] megaraid abort: > > > 55592077:62[255:128], fw owner > > > 1 Time(s): [5535381.561000] megaraid abort: > > > 55592078[255:128], driver owner > > > 1 Time(s): [5535381.561000] megaraid mbox: Wait for 2 > commands to > > > complete:180 > > > 1 Time(s): [5535381.561000] megaraid: 2 outstanding > commands. Max > > > wait 180 sec > > > 1 Time(s): [5535381.561000] megaraid: aborting-55592075 > > > cmd=28 <c=1 t=0 l=0> > > > 1 Time(s): [5535381.561000] megaraid: aborting-55592077 > > > cmd=28 <c=1 t=0 l=0> > > > 1 Time(s): [5535381.561000] megaraid: aborting-55592078 > > > cmd=28 <c=1 t=0 l=0> > > > 1 Time(s): [5535381.561000] megaraid: reseting the host... > > > 1 Time(s): [5535386.566000] megaraid mbox: Wait for 2 > commands to > > > complete:175 > > > > > > [... The above line repeat every 5 seconds, counting down > to 0 ...] > > > > > > 1 Time(s): [5535556.736000] megaraid mbox: Wait for 2 > commands to > > > complete:5 > > > 2 Time(s): [5535562.611000] scsi2 (0:0): rejecting I/O > to offline > > > device > > > > > > The only difference in the two instances is the number of > > "commands" > > > that are waiting to complete. This snippet above is from > the first > > > instance, the second instance had 10 commands waiting. > > > > > > The machine is running the default Ubuntu kernel, which is their > > > patched version of 2.6.12. In addition, both the > megaraid_mbox and > > > megaraid_mm modules are loaded. Here is an output of > 'modinfo' for > > > both of those modules: > > > > > > ============================================================== > > > ========================== > > > megaraid_mbox > > > -------------------------------------------------------------- > > > -------------------------- > > > filename: > > > /lib/modules/2.6.12-10-386/kernel/drivers/scsi/megaraid/megara > > > id_mbox.ko > > > author: LSI Logic Corporation > > > description: LSI Logic MegaRAID Mailbox Driver > > > license: GPL > > > version: 2.20.4.5 > > > vermagic: 2.6.12-10-386 386 gcc-3.4 > > > depends: megaraid_mm,scsi_mod > > > alias: pci:v00001028d0000000Esv00001028sd00000123bc*sc*i* > > > alias: pci:v00001000d00001960sv00001028sd00000520bc*sc*i* > > > alias: pci:v00001000d00001960sv00001028sd00000518bc*sc*i* > > > alias: pci:v00001000d00000407sv00001028sd00000531bc*sc*i* > > > alias: pci:v00001028d0000000Fsv00001028sd0000014Abc*sc*i* > > > alias: pci:v00001028d00000013sv00001028sd0000016Cbc*sc*i* > > > alias: pci:v00001028d00000013sv00001028sd0000016Dbc*sc*i* > > > alias: pci:v00001028d00000013sv00001028sd0000016Ebc*sc*i* > > > alias: pci:v00001028d00000013sv00001028sd0000016Fbc*sc*i* > > > alias: pci:v00001028d00000013sv00001028sd00000170bc*sc*i* > > > alias: pci:v00001000d00000408sv00001028sd00000002bc*sc*i* > > > alias: pci:v00001000d00000408sv00001028sd00000001bc*sc*i* > > > alias: pci:v0000101Ed00001960sv00001028sd00000471bc*sc*i* > > > alias: pci:v0000101Ed00001960sv00001028sd00000493bc*sc*i* > > > alias: pci:v0000101Ed00001960sv00001028sd00000475bc*sc*i* > > > alias: pci:v0000101Ed00001960sv0000101Esd00000475bc*sc*i* > > > alias: pci:v0000101Ed00001960sv0000101Esd00000493bc*sc*i* > > > alias: pci:v00001000d00001960sv00001000sd0000A520bc*sc*i* > > > alias: pci:v00001000d00001960sv00001000sd00000520bc*sc*i* > > > alias: pci:v00001000d00001960sv00001000sd00000518bc*sc*i* > > > alias: pci:v00001000d00000407sv00001000sd00000530bc*sc*i* > > > alias: pci:v00001000d00000407sv00001000sd00000532bc*sc*i* > > > alias: pci:v00001000d00000407sv00001000sd00000531bc*sc*i* > > > alias: pci:v00001000d00000408sv00001000sd00000001bc*sc*i* > > > alias: pci:v00001000d00000408sv00001000sd00000002bc*sc*i* > > > alias: pci:v00001000d00001960sv00001000sd00000522bc*sc*i* > > > alias: pci:v00001000d00001960sv00001000sd00004523bc*sc*i* > > > alias: pci:v00001000d00001960sv00001000sd00000523bc*sc*i* > > > alias: pci:v00001000d00000409sv00001000sd00003004bc*sc*i* > > > alias: pci:v00001000d00000409sv00001000sd00003008bc*sc*i* > > > alias: pci:v00001000d00000407sv00008086sd00000532bc*sc*i* > > > alias: pci:v00001000d00001960sv00008086sd00000523bc*sc*i* > > > alias: pci:v00001000d00000408sv00008086sd00000002bc*sc*i* > > > alias: pci:v00001000d00000407sv00008086sd00000530bc*sc*i* > > > alias: pci:v00001000d00000409sv00008086sd00003008bc*sc*i* > > > alias: pci:v00001000d00000408sv00008086sd00003431bc*sc*i* > > > alias: pci:v00001000d00000408sv00008086sd00003499bc*sc*i* > > > alias: pci:v00001000d00001960sv00008086sd00000520bc*sc*i* > > > alias: pci:v00001000d00000408sv00001734sd00001065bc*sc*i* > > > alias: pci:v00001000d00000408sv00001025sd0000004Dbc*sc*i* > > > alias: pci:v00001000d00000408sv00001033sd00008287bc*sc*i* > > > srcversion: 042A4371A952248BEF860F4 > > > parm: debug_level:Debug level for driver > (default=0) (int) > > > parm: fast_load:Faster loading of the driver, skips > > > physical devices! (default=0) (int) > > > parm: cmd_per_lun:Maximum number of commands per > > > logical unit (default=64) (int) > > > parm: max_sectors:Maximum number of sectors per IO > > > command (default=128) (int) > > > parm: busy_wait:Max wait for mailbox in > > > microseconds if busy (default=10) (int) > > > parm: unconf_disks:Set to expose unconfigured disks > > > to kernel (default=0) (int) > > > > > > -------------------------------------------------------------- > > > -------------------------- > > > megaraid_mm: > > > -------------------------------------------------------------- > > > -------------------------- > > > filename: > > > > > > /lib/modules/2.6.12-10-386/kernel/drivers/scsi/megaraid/megaraid_mm.ko > > > author: LSI Logic Corporation > > > description: LSI Logic Management Module > > > license: GPL > > > version: 2.20.2.5 > > > vermagic: 2.6.12-10-386 386 gcc-3.4 > > > depends: > > > srcversion: D2DA33EA7F3FEA9EBE4A603 > > > parm: dlevel:Debug level (default=0) (int) > > > ============================================================== > > > ========================== > > > > > > I have contacted Dell - via their linux-poweredge mailing > > list - and > > > have discovered that I am not the only one experiencing these > > > problems. What bothers me is that while this problem, > > apparently, has > > > been around a while and no fix has yet been discovered by Dell or > > > anyone else. > > > > > > My research also leads me to believe that this is not just > > an Ubuntu > > > thing either. I have reports that this happens under > > Redhat, Debian > > > and SuSE. It also appears as though the problem started > happening > > > around kernel version 2.6.9. > > > > > > So, I'm hoping that someone here: > > > > > > a). Knows about the problem and is working on it. > > > > > > - and, more importantly - > > > > > > b). Can lead me to a fix. > > > > > > My machine is in production and I do not have any > > additional hardware > > > to test with, but I can do limited testing with it as > long as it is > > > completely functional by 8:00 pm eastern time. I'm using it as > > > offsite backup machine and that's when my backup processes > > kick in. > > > If more information is needed, let me know how to get it, > and I'll > > > supply it. > > > > > > I need to get this solved ASAP. > > > > > > Thanks in advance, > > > > > > -- > > > Kevin L. Collins, MCSE > > > Systems Manager > > > Nesbitt Engineering, Inc. > > > - > > > : send the line "unsubscribe > > linux-scsi" > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > > majordomo > > > info at http://vger.kernel.org/majordomo-info.html > > > > > > > > - : send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html