Re: Is there a drive error "retry" parameter?

Michael Tokarev <mjt@xxxxxxxxxx> · Thu, 02 Jun 2005 21:16:05 +0400

Carlos Knowlton wrote:
> I want to understand exactly what is going on in the Software RAID 5
> code when a drive is marked "dirty", and booted from the array.  Based
> on what I've read so far, it seems that this happens any time the RAID
> software runs into a read or write error that might have been corrected
> by fsck (if it had been there first).  Is this true?

You're mixing up 2 very different things here.  Very different.

Fsck has nothing to do with raid, per se.  Fsck checks the filesystem
which is on top of a block device (be it a raid array, a disk, or a
loopback device, whatever).  It does not understand/know what is "raid",
at all.  Speaking of raid, the filesystem is an upper-level stuff.  Again,
raid code knows nothing about filesystems or any data it stores.  Also,
filesystem obviously does not know about underlying components of the
raid array where the filesystem resides -- so fsck can NOT "fix" whatever
error happened two layers down the stack (fs, raid, underlying devices).

>From the other side, raid code ensures (or tries to, anyway) that any
errors in underlying (components) devices will not propagate to the
upper level (be it a filesystem, database or anything else - raid does
not care what data it stores).  It is here to "hide" whatever errors
may happen on the physical device (disk drive).  Currently, if enouth
drives fails, raid array will be "shut down" so that the upper level
(eg filesystem) can't even access the whole raid array.  Until that
happens, there should be no errors propagated to the filesystem layer,
all such errors will be corrected by raid code, ensuring that it will
read the same data as has been written to it.

> Is there a "retry" parameter that can be set in the kernel parameters,
> or else in the code itself to prolong the existence of a drive in an
> array before it is considered dirty?

There's no such parameter currently.  But there was several discussions
about how to make raid code more robust - in particular, in case of
read error, raid code may keep the errored drive in the array and mark
it dirty only in case of write error.

> If so, I would like to increase it in my environment, because it seems
> like I'm losing drives in my array that are often still quite stable.

I think you have to provide some more information.  Kernel logging tells
alot of details about what exactly happening and what the raid code is
doing as a result of that.

Raid code is quite stable and is used in alot of machines all over the
world.  If you're expiriencing such a weird behaviour, I think it's due
to some othe problem on your side, and the best would be to find and fix
the real error, not the symptom.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html