Re: Question about raid robustness when disk fails

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Asdo <asdo@xxxxxxxxxxxxx> writes:

> Goswin von Brederlow wrote:
>>> Is it possible to cancel a SATA/SCSI command that is being executed by
>>> the drive?
>>> (it's probably feasible only with NCQ disabled anyway, but it's easy
>>> to disable NCQ)
>>>
>>
>> Do you want to do that? I would rather have the drive keep trying and
>> return an error if it can't read so the raid layer rewrites the blocks
>> causing it to be remapped. I do not want to wait for that but I want it
>> to happen.
>>
> So you want that to happen in the background?
> Not that much benefit for that to happen in the background, imho.
> Why not just having an error returned after a timeout, and normal MD
> read-error-recovery procedure kicking in? (recomputation from parity
> and rewrite of the damaged block)

Because the drive might just had a seek error and needs to reposition
its head. It might have been accessed on another partition and have a
read error there taking time. Or just multiple reads on the
partition. The drive taking long doesn't mean THIS read is broken.

If you kick of a read-error-recovery and get another error on another
drive then your raid will be down as well. Better not risk that.

>>> It's a pity we have to rely on TLER, this narrows the choice of drives
>>> a lot...
>>>
>>
>> I don't. I just acknowledge the limitation and accept the downtime
> The time might be so long that MD or the controller can drop the
> entire drive.
> It didn't happen to me but I think I read something like this on this ML...

Downtime as in I had to shut down the system hard and remove a drive at
a time till it would boot again when I came home in the evening.

If it just hangs for 5 minutes till it kicks a drive but then continious
running I still call that a success.

>> to
>> find and remove a broken but not properly failed disk. I use raid so I
>> don't loose my data when a disk fails, not primarily for availability.
>> So far I had one case in 10 years where a failing disk took down my
>> system.
>>

MfG
        Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux