Re: Western Digital Scorpio and ICH10R on Debian - NCQ issue?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jul 19, 2011 at 7:20 AM, Sandra Escandor <sescandor@xxxxxxxxxx> wrote:
> I was just reading over the kernel logs that I sent again, and I am
> wondering if this might be a software issue instead, since the kernel
> log shows that the drive that seems to time out is supposedly disabled
> after disk failure (sdc was disabled by raid10 module, I think):
>
> Jul  8 14:57:19 ecs-1u kernel: [ 8753.699104] sd 2:0:0:0: [sdc]
> Unhandled error code
> Jul  8 14:57:19 ecs-1u kernel: [ 8753.699107] sd 2:0:0:0: [sdc] Result:
> hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
> Jul  8 14:57:19 ecs-1u kernel: [ 8753.699110] sd 2:0:0:0: [sdc] CDB:
> Write(10): 2a 00 3e cf 18 00 00 04 00 00
> Jul  8 14:57:19 ecs-1u kernel: [ 8753.699117] end_request: I/O error,
> dev sdc, sector 1053759488
> Jul  8 14:57:19 ecs-1u kernel: [ 8753.699144] raid10: Disk failure on
> sdc, disabling device.
> Jul  8 14:57:19 ecs-1u kernel: [ 8753.699144] raid10: Operation
> continuing on 3 devices.
>
> But then, a whole while later, there is an unhandled error code coming
> from sdc - shouldn't we no longer get this now, since it was supposedly
> disabled?

The RAID layer will "disable" the device after it gets an IO request
failure. However, some error handling by the SCSI or libata layers may
still be going on in the background, but the RAID layer doesn't want
to wait for that to finish.

>
> Jul  8 14:58:17 ecs-1u kernel: [ 8812.088705] sd 2:0:0:0: [sdc]
> Unhandled error code
> Jul  8 14:58:17 ecs-1u kernel: [ 8812.088710] sd 2:0:0:0: [sdc] Result:
> hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
> Jul  8 14:58:17 ecs-1u kernel: [ 8812.088714] sd 2:0:0:0: [sdc] CDB:
> Write(10): 2a 00 3e cf 63 00 00 04 00 00
> Jul  8 14:58:17 ecs-1u kernel: [ 8812.088723] end_request: I/O error,
> dev sdc, sector 1053778688
>
> Is the [sdc] output coming from libata still?
>
> Thanks for your help on this, I feel like I've been stuck for a bit :)
>
> -----Original Message-----
> From: Robert Hancock [mailto:hancockrwd@xxxxxxxxx]
> Sent: Monday, July 18, 2011 12:41 PM
> To: Sandra Escandor
> Cc: linux-ide@xxxxxxxxxxxxxxx
> Subject: Re: Western Digital Scorpio and ICH10R on Debian - NCQ issue?
>
> On Mon, Jul 18, 2011 at 6:42 AM, Sandra Escandor <sescandor@xxxxxxxxxx>
> wrote:
>> Thanks for the insight Robert. Do you (or anyone else on the list)
> know
>> if there are any utilities that exist that would be able to allow me
> to
>> observe (and log) the power consumption of the drives during high I/O?
>
> I don't think there's anything that you could do to measure this in
> software. A clamp-on ammeter on one of the power supply wires would
> give you a measurement, but it might not catch brief current spikes
> that could be causing problems.
>
> Usually these kinds of problems get fixed by trial and error (swapping
> drives between cables, a different PSU).
>
>>
>> -----Original Message-----
>> From: Robert Hancock [mailto:hancockrwd@xxxxxxxxx]
>> Sent: Friday, July 15, 2011 9:17 PM
>> To: Sandra Escandor
>> Cc: linux-ide@xxxxxxxxxxxxxxx
>> Subject: Re: Western Digital Scorpio and ICH10R on Debian - NCQ issue?
>>
>> On 07/12/2011 10:21 AM, Sandra Escandor wrote:
>>> The Situation:
>>> It appears that a WRITE FPDMA QUEUED failed command causes driver
>>> timeouts - this in turn locks up the RAID (which once worked pretty
>>> well). This occurred during high I/O.
>>>
>>> The question:
>>> 1. Is it a good idea to turn off NCQ? I've read in different posts
>> that
>>> it helps some, but not others - I'm currently on the way to getting
> an
>>> experimental box setup, but I wanted to confirm if this was a good
>> idea.
>>
>> Not really a solution to anything, at least not likely in this case.
>> More of a workaround that might happen to work by chance.
>>
>>> 2. Are there known issues with the ICH10R + WD7500BPKT-00PK4T0 and
> the
>>> libata driver?
>>
>> Nothing known, no.
>>
>>>
>>> The System:
>>> Four WDC WD7500BPKT-00PK4T0 drives (Western Digital Scorpio) - in
>> RAID10
>>> array created using mdadm 3.1.4
>>> ICH10R sata controller.
>>> Kernel 2.6.32-5-amd64
>>
>> The fact that you have multiple drives and the problem tends to occur
>> during heavy I/O may point to a power issue. This has been known to
>> happen when some of the drives aren't getting enough power when there
>> are spikes in power draw during I/O access. In this case, using a
>> beefier power supply or spreading the drives out across different
> cables
>>
>> from the PSU may help.
>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux