Re: md road-map: 2011

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 02/17/2011 11:58 AM, Keld Jørn Simonsen wrote:
On Thu, Feb 17, 2011 at 11:45:35AM +0100, David Brown wrote:
On 17/02/2011 02:04, Keld Jørn Simonsen wrote:
On Thu, Feb 17, 2011 at 01:30:49AM +0100, David Brown wrote:
On 17/02/11 00:01, NeilBrown wrote:
On Wed, 16 Feb 2011 23:34:43 +0100 David Brown<david.brown@xxxxxxxxxxxx>
wrote:

I thought there was some mechanism for block devices to report bad
blocks back to the file system, and that file systems tracked bad block
lists.  Modern drives automatically relocate bad blocks (at least, they
do if they can), but there was a time when they did not and it was up to
the file system to track these.  Whether that still applies to modern
file systems, I do not know - they only file system I have studied in
low-level detail is FAT16.
When the block device reports an error the filesystem can certainly
record
that information in a bad-block list, and possibly does.

However I thought you were suggesting a situation where the block device
could succeed with the request, but knew that area of the device was of
low
quality.
I guess that is what I was trying to suggest, though not very clearly.

e.g. IO to a block on a stripe which had one 'bad block'.  The IO should
succeed, but the data isn't as safe as elsewhere.  It would be nice if we
could tell the filesystem that fact, and if it could make use of it. But
we
currently cannot.   We can say "success" or "failure", but we cannot say
"success, but you might not be so lucky next time".

Do filesystems re-try reads when there is a failure?  Could you return
fail on one read, then success on a re-read, which could be interpreted
as "dying, but not yet dead" by the file system?
This should not be a file system feature. The file system is built upon
the raid, and in mirrorred raid types like raid1 and raid10, and also
other raid types, you cannot be sure which specific drive and sector the
data was read from - it could be one out of many (typically two) places.
So the bad blocks of a raid is a feature of the raid and its individual
drives, not the file system. If it was a property of the file system,
then the fs should be aware of the underlying raid topology, and know if
this was a parity block or data block of raid5 or raid6, or which
mirror instance of a raid1/10 type which  was involved.

Thanks for the explanation.

I guess my worry is that if md layer has tracked a bad block on a disk,
then that stripe will be in a degraded mode.  It's great that it will
still work, and it's great that the bad block list means that it is
/only/ that stripe that is degraded - not the whole raid.
I am proposing that the stripe not be degraded, using a recovery area for bad
blocks on the disk, that goes together with the metadata area.

But I'm hoping there can be some sort of relocation somewhere
(ultimately it doesn't matter if it is handled by the file system, or by
md for the whole stripe, or by md for just that disk block, or by the
disk itself), so that you can get raid protection again for that stripe.
I think we agree in hoping:-)

IMHO the point is that this feature (Bad Block Log) is a GREAT feature as it just helps in keeping track of the health status of the underlying disks, and helps A LOT in recovering data from the array when a unrecoverable read error occurs (now the full array goes offline). Then something must be done proactively to repair the situation, as it means that a disk of the array has problems and should be replaced. So, first it's worth to make a backup of the still alive array (getting some read error when the bad blocks/stripes are encountered [maybe using ddrescue or similar]), then replace the disk, and reconstruct the array; after that a fsck on the filesystem may repair the situation.

You may argue that the unrecoverable read error come from just very few sector of the disk, and it's not worth to replace it (personally I would replace also on very few ones), as there are still many reserverd sectors for relocation on the disk. Then a simple solution would just be to zero-write the bad blocks in the Bad Block Log (the data is gone already): if the write succedes (disk uses reserved sectors for relocation), the blocks are removed from the log (now they are ok); then fsck (hopefully) may repair the filesystem. At this point there are no more md read erros, maybe just filesystem errors (the array is clean, the filesystem may be not, but notice that nothing can be done to avoid filesystem problems, as there has been a data loss; only fsck may help).

Regards

--
Cordiali saluti.
Yours faithfully.

Giovanni Tessore


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux