Re: Pls help for LibATA caused kernel temporary hung.

Simon Li <simon.jiyou@xxxxxxxxx> · Mon, 28 Jun 2010 15:38:48 +0800

Hi, Tejun,

I put some notes for the issue and our application usages as below:

1. One blade server contains 4 sata disks(All Hitachi, 250G).
2. Application access disk paritions via RAW mode and in 7x24 hours load.
3. Application R/W disk unit mostly are 512Kbytes size.
4. Kernel temporary hung occurred frequently in 1~3 times per day, for
some very bad performed blades, above 5+ times a day. Kernel hang
duration sometimes last several seconds, and the longest duration we
observed is 5 minutes, afterwards, the kernel got resumed magicially.
We had further guess for the cause of hung-duration variation,
explained below:

*** In a single disk read operation(as said above, 512K bytes in one
syscall read(...)), if this 512Kbytes range contains two bad sectors,
the far distance of
*** these two bad sectors, the longer kernel hung duration, vice
verse. Following logs illustrated the guess.

Blade 1:
May 25 15:55:06 shctc-xq-ems22-me18 kernel: ata3: translated ATA
stat/err 0x25/00 to SCSI SK/ASC/ASCQ 0x4/00/00
May 25 15:55:06 shctc-xq-ems22-me18 kernel: ata3: status=0x25 {
DeviceFault CorrectedError Error }
May 25 15:55:06 shctc-xq-ems22-me18 kernel: SCSI error : <2 0 0 0>
return code = 0x8000002
May 25 15:55:06 shctc-xq-ems22-me18 kernel: sdc: Current: sense key:
Hardware Error
May 25 15:55:06 shctc-xq-ems22-me18 kernel:     Additional sense: No
additional sense information
May 25 15:55:06 shctc-xq-ems22-me18 kernel: end_request: I/O error,
dev sdc, sector 276495150             <<--- First bad sector
encountered
---------- Kernel hung period ---------
May 25 15:59:59 shctc-xq-ems22-me18 kernel: end_request: I/O error,
dev sdc, sector 276496053
May 25 15:59:59 shctc-xq-ems22-me18 kernel: ata3: translated ATA
stat/err 0x25/00 to SCSI SK/ASC/ASCQ 0x4/00/00
May 25 15:59:59 shctc-xq-ems22-me18 kernel: ata3: status=0x25 {
DeviceFault CorrectedError Error }
May 25 15:59:59 shctc-xq-ems22-me18 kernel: SCSI error : <2 0 0 0>
return code = 0x8000002
May 25 15:59:59 shctc-xq-ems22-me18 kernel: sdc: Current: sense key:
Hardware Error
May 25 15:59:59 shctc-xq-ems22-me18 kernel:     Additional sense: No
additional sense information
May 25 15:59:59 shctc-xq-ems22-me18 kernel: end_request: I/O error,
dev sdc, sector 276496061              <<--- Second bad sector
encountered

The two bad sectors were within one single read(...512Kbytes) syscall.
You can see 911 sectors in between, caused nearly 5 mins kernel hung.

Blade 2:
Jun  2 10:54:06 shctc-xm-ems21-me18 kernel: ata2: translated ATA
stat/err 0x25/00 to SCSI SK/ASC/ASCQ 0x4/00/00
Jun  2 10:54:06 shctc-xm-ems21-me18 kernel: ata2: status=0x25 {
DeviceFault CorrectedError Error }
Jun  2 10:54:06 shctc-xm-ems21-me18 kernel: SCSI error : <1 0 0 0>
return code = 0x8000002
Jun  2 10:54:06 shctc-xm-ems21-me18 kernel: sdb: Current: sense key:
Hardware Error
Jun  2 10:54:06 shctc-xm-ems21-me18 kernel:     Additional sense: No
additional sense information
Jun  2 10:54:06 shctc-xm-ems21-me18 kernel: end_request: I/O error,
dev sdb, sector 183410550
Jun  2 10:54:08 shctc-xm-ems21-me18 kernel: ata2: translated ATA
stat/err 0x25/00 to SCSI SK/ASC/ASCQ 0x4/00/00
Jun  2 10:54:08 shctc-xm-ems21-me18 kernel: ata2: status=0x25 {
DeviceFault CorrectedError Error }
Jun  2 10:54:08 shctc-xm-ems21-me18 kernel: SCSI error : <1 0 0 0>
return code = 0x8000002
Jun  2 10:54:08 shctc-xm-ems21-me18 kernel: sdb: Current: sense key:
Hardware Error
Jun  2 10:54:08 shctc-xm-ems21-me18 kernel:     Additional sense: No
additional sense information
Jun  2 10:54:08 shctc-xm-ems21-me18 kernel: end_request: I/O error,
dev sdb, sector 183410557                <<--- First bad sector
encountered
Jun  2 10:54:10 shctc-xm-ems21-me18 kernel: ata2: translated ATA
stat/err 0x25/00 to SCSI SK/ASC/ASCQ 0x4/00/00
Jun  2 10:54:18 shctc-xm-ems21-me18 kernel: ata2: status=0x25 {
DeviceFault CorrectedError Error }
Jun  2 10:54:18 shctc-xm-ems21-me18 kernel: SCSI error : <1 0 0 0>
return code = 0x8000002
Jun  2 10:54:18 shctc-xm-ems21-me18 kernel: sdb: Current: sense key:
Hardware Error
Jun  2 10:54:18 shctc-xm-ems21-me18 kernel:     Additional sense: No
additional sense information
Jun  2 10:54:18 shctc-xm-ems21-me18 kernel: end_request: I/O error,
dev sdb, sector 183410565                <<--- Second bad sector
encountered
Jun  2 10:54:18 shctc-xm-ems21-me18 kernel: [RSS-RAW] Failed to read
data to buf 0xa036a200 at 0x1383335400 of size 524288 from raw 2 in
14398

The two sectors were also within one single read(...512Kbytes)
syscall, while only 8 sectors in between, caused 10 secs hungs.

5. Attached include sda, sdc's smartctl, hdparm and sys logs, both
drives from the same blade, regretfully, the sys log attached was
mistakenly overwriten and you cann't find logs as above. But these two
drives did occur the issue and exactly print logs in syslog as
above(except that sector number are different), Pls reference above
syslog prints.

6. Kernel version: 2.6.13.2, (gcc version 4.0.0 20050519 (Red Hat 4.0.0-8)).

Thanks in advance for your interest, we really look forward to fixing
this issue.

Regards
Simon

On Mon, Jun 28, 2010 at 12:09 AM, Tejun Heo <htejun@xxxxxxxxx> wrote:
>
> Hello,
>
> Please cc linux-ide@xxxxxxxxxxxxxxx when you reply.
>
> On 06/27/2010 05:58 PM, Simon Li wrote:
> > Hi, Jun,
>
> My first name happens to be Tejun.  :-)
>
> > ===== First time we observed kernel hang =======
> > May 25 15:55:06 shctc-xq-ems22-me18 kernel: ata3: translated ATA
> > stat/err 0x25/00 to SCSI SK/ASC/ASCQ 0x4/00/00
> > May 25 15:55:06 shctc-xq-ems22-me18 kernel: ata3: status=0x25 {
> > DeviceFault CorrectedError Error }
> > May 25 15:55:06 shctc-xq-ems22-me18 kernel: SCSI error : <2 0 0 0>
> > return code = 0x8000002
> > May 25 15:55:06 shctc-xq-ems22-me18 kernel: sdc: Current: sense key:
> > Hardware Error
> > May 25 15:55:06 shctc-xq-ems22-me18 kernel:     Additional sense: No
> > additional sense information
> > May 25 15:55:06 shctc-xq-ems22-me18 kernel: end_request: I/O error, dev
> > sdc, sector 276495150
>
> Looks like failing hard disk to me.  Can you please do the followings
> when you reply?
>
> * Please compose in plain text.  No html.
>
> * Attach full kernel log including the boot and error messages.
>  Capturing output of dmesg should do it.
>
> * Attach the output of hdparm -I and smartctl -a on the drive.
>
> --
> tejun
Attachment:
smartctl_sda.dump

Description: Binary data
Attachment:
smartctl_sdc.dump

Description: Binary data
Attachment:
hdparm_sda.dump

Description: Binary data
Attachment:
hdparm_sdc.dump

Description: Binary data
Attachment:
messages

Description: Binary data