Hi, Tejun, I put some notes for the issue and our application usages as below: 1. One blade server contains 4 sata disks(All Hitachi, 250G). 2. Application access disk paritions via RAW mode and in 7x24 hours load. 3. Application R/W disk unit mostly are 512Kbytes size. 4. Kernel temporary hung occurred frequently in 1~3 times per day, for some very bad performed blades, above 5+ times a day. Kernel hang duration sometimes last several seconds, and the longest duration we observed is 5 minutes, afterwards, the kernel got resumed magicially. We had further guess for the cause of hung-duration variation, explained below: *** In a single disk read operation(as said above, 512K bytes in one syscall read(...)), if this 512Kbytes range contains two bad sectors, the far distance of *** these two bad sectors, the longer kernel hung duration, vice verse. Following logs illustrated the guess. Blade 1: May 25 15:55:06 shctc-xq-ems22-me18 kernel: ata3: translated ATA stat/err 0x25/00 to SCSI SK/ASC/ASCQ 0x4/00/00 May 25 15:55:06 shctc-xq-ems22-me18 kernel: ata3: status=0x25 { DeviceFault CorrectedError Error } May 25 15:55:06 shctc-xq-ems22-me18 kernel: SCSI error : <2 0 0 0> return code = 0x8000002 May 25 15:55:06 shctc-xq-ems22-me18 kernel: sdc: Current: sense key: Hardware Error May 25 15:55:06 shctc-xq-ems22-me18 kernel: Additional sense: No additional sense information May 25 15:55:06 shctc-xq-ems22-me18 kernel: end_request: I/O error, dev sdc, sector 276495150 <<--- First bad sector encountered ---------- Kernel hung period --------- May 25 15:59:59 shctc-xq-ems22-me18 kernel: end_request: I/O error, dev sdc, sector 276496053 May 25 15:59:59 shctc-xq-ems22-me18 kernel: ata3: translated ATA stat/err 0x25/00 to SCSI SK/ASC/ASCQ 0x4/00/00 May 25 15:59:59 shctc-xq-ems22-me18 kernel: ata3: status=0x25 { DeviceFault CorrectedError Error } May 25 15:59:59 shctc-xq-ems22-me18 kernel: SCSI error : <2 0 0 0> return code = 0x8000002 May 25 15:59:59 shctc-xq-ems22-me18 kernel: sdc: Current: sense key: Hardware Error May 25 15:59:59 shctc-xq-ems22-me18 kernel: Additional sense: No additional sense information May 25 15:59:59 shctc-xq-ems22-me18 kernel: end_request: I/O error, dev sdc, sector 276496061 <<--- Second bad sector encountered The two bad sectors were within one single read(...512Kbytes) syscall. You can see 911 sectors in between, caused nearly 5 mins kernel hung. Blade 2: Jun 2 10:54:06 shctc-xm-ems21-me18 kernel: ata2: translated ATA stat/err 0x25/00 to SCSI SK/ASC/ASCQ 0x4/00/00 Jun 2 10:54:06 shctc-xm-ems21-me18 kernel: ata2: status=0x25 { DeviceFault CorrectedError Error } Jun 2 10:54:06 shctc-xm-ems21-me18 kernel: SCSI error : <1 0 0 0> return code = 0x8000002 Jun 2 10:54:06 shctc-xm-ems21-me18 kernel: sdb: Current: sense key: Hardware Error Jun 2 10:54:06 shctc-xm-ems21-me18 kernel: Additional sense: No additional sense information Jun 2 10:54:06 shctc-xm-ems21-me18 kernel: end_request: I/O error, dev sdb, sector 183410550 Jun 2 10:54:08 shctc-xm-ems21-me18 kernel: ata2: translated ATA stat/err 0x25/00 to SCSI SK/ASC/ASCQ 0x4/00/00 Jun 2 10:54:08 shctc-xm-ems21-me18 kernel: ata2: status=0x25 { DeviceFault CorrectedError Error } Jun 2 10:54:08 shctc-xm-ems21-me18 kernel: SCSI error : <1 0 0 0> return code = 0x8000002 Jun 2 10:54:08 shctc-xm-ems21-me18 kernel: sdb: Current: sense key: Hardware Error Jun 2 10:54:08 shctc-xm-ems21-me18 kernel: Additional sense: No additional sense information Jun 2 10:54:08 shctc-xm-ems21-me18 kernel: end_request: I/O error, dev sdb, sector 183410557 <<--- First bad sector encountered Jun 2 10:54:10 shctc-xm-ems21-me18 kernel: ata2: translated ATA stat/err 0x25/00 to SCSI SK/ASC/ASCQ 0x4/00/00 Jun 2 10:54:18 shctc-xm-ems21-me18 kernel: ata2: status=0x25 { DeviceFault CorrectedError Error } Jun 2 10:54:18 shctc-xm-ems21-me18 kernel: SCSI error : <1 0 0 0> return code = 0x8000002 Jun 2 10:54:18 shctc-xm-ems21-me18 kernel: sdb: Current: sense key: Hardware Error Jun 2 10:54:18 shctc-xm-ems21-me18 kernel: Additional sense: No additional sense information Jun 2 10:54:18 shctc-xm-ems21-me18 kernel: end_request: I/O error, dev sdb, sector 183410565 <<--- Second bad sector encountered Jun 2 10:54:18 shctc-xm-ems21-me18 kernel: [RSS-RAW] Failed to read data to buf 0xa036a200 at 0x1383335400 of size 524288 from raw 2 in 14398 The two sectors were also within one single read(...512Kbytes) syscall, while only 8 sectors in between, caused 10 secs hungs. 5. Attached include sda, sdc's smartctl, hdparm and sys logs, both drives from the same blade, regretfully, the sys log attached was mistakenly overwriten and you cann't find logs as above. But these two drives did occur the issue and exactly print logs in syslog as above(except that sector number are different), Pls reference above syslog prints. 6. Kernel version: 2.6.13.2, (gcc version 4.0.0 20050519 (Red Hat 4.0.0-8)). Thanks in advance for your interest, we really look forward to fixing this issue. Regards Simon On Mon, Jun 28, 2010 at 12:09 AM, Tejun Heo <htejun@xxxxxxxxx> wrote: > > Hello, > > Please cc linux-ide@xxxxxxxxxxxxxxx when you reply. > > On 06/27/2010 05:58 PM, Simon Li wrote: > > Hi, Jun, > > My first name happens to be Tejun. :-) > > > ===== First time we observed kernel hang ======= > > May 25 15:55:06 shctc-xq-ems22-me18 kernel: ata3: translated ATA > > stat/err 0x25/00 to SCSI SK/ASC/ASCQ 0x4/00/00 > > May 25 15:55:06 shctc-xq-ems22-me18 kernel: ata3: status=0x25 { > > DeviceFault CorrectedError Error } > > May 25 15:55:06 shctc-xq-ems22-me18 kernel: SCSI error : <2 0 0 0> > > return code = 0x8000002 > > May 25 15:55:06 shctc-xq-ems22-me18 kernel: sdc: Current: sense key: > > Hardware Error > > May 25 15:55:06 shctc-xq-ems22-me18 kernel: Additional sense: No > > additional sense information > > May 25 15:55:06 shctc-xq-ems22-me18 kernel: end_request: I/O error, dev > > sdc, sector 276495150 > > Looks like failing hard disk to me. Can you please do the followings > when you reply? > > * Please compose in plain text. No html. > > * Attach full kernel log including the boot and error messages. > Capturing output of dmesg should do it. > > * Attach the output of hdparm -I and smartctl -a on the drive. > > -- > tejun
Attachment:
smartctl_sda.dump
Description: Binary data
Attachment:
smartctl_sdc.dump
Description: Binary data
Attachment:
hdparm_sda.dump
Description: Binary data
Attachment:
hdparm_sdc.dump
Description: Binary data
Attachment:
messages
Description: Binary data