Re: md road-map: 2011

Roberto Spadim <roberto@xxxxxxxxxxxxx> · Thu, 17 Feb 2011 14:22:33 -0200

another question, out of bad block, but related to write-behind
in mysql ndb, i can run many machines in the cluster, in a write, if 2
machines return commit OK, nbd put all others machines to async write,
it´s nice, because speed is improved and i have 1 computer redudancy,
could we implement a diferent write-behind method? i was talking about
it in another email thread
something like:
select what disks must be write-mostly (only read if all mirrors are failed)
select what disks MUST be commited (sync)
select what disks MUST be write-behind (async)
select what disks can be automatic (sync/async, if X disks are
commited theses disks are automatic write-behind, after write they get
backto non write-behind, idon´t see a solution at userspace, just at
kernel space)
itwil help a mix of raid1 with slow and fast disks, maybe the problem
of accesstime can be reduced for harddisks, the raid1 isn´t more
slowest disk speed for write
check that write-mostly is for read_balance
write-behind for write command

2011/2/17 Keld Jørn Simonsen <keld@xxxxxxxxxx>:
> On Thu, Feb 17, 2011 at 12:45:42PM +0100, Giovanni Tessore wrote:
>> On 02/17/2011 11:58 AM, Keld Jørn Simonsen wrote:
>> >On Thu, Feb 17, 2011 at 11:45:35AM +0100, David Brown wrote:
>> >>On 17/02/2011 02:04, Keld Jørn Simonsen wrote:
>> >>>On Thu, Feb 17, 2011 at 01:30:49AM +0100, David Brown wrote:
>> >>>>On 17/02/11 00:01, NeilBrown wrote:
>> >>>>>On Wed, 16 Feb 2011 23:34:43 +0100 David
>> >>>>>Brown<david.brown@xxxxxxxxxxxx>
>> >>>>>wrote:
>> >>>>>
>> >>>>>>I thought there was some mechanism for block devices to report bad
>> >>>>>>blocks back to the file system, and that file systems tracked bad
>> >>>>>>block
>> >>>>>>lists.  Modern drives automatically relocate bad blocks (at least,
>> >>>>>>they
>> >>>>>>do if they can), but there was a time when they did not and it was up
>> >>>>>>to
>> >>>>>>the file system to track these.  Whether that still applies to modern
>> >>>>>>file systems, I do not know - they only file system I have studied in
>> >>>>>>low-level detail is FAT16.
>> >>>>>When the block device reports an error the filesystem can certainly
>> >>>>>record
>> >>>>>that information in a bad-block list, and possibly does.
>> >>>>>
>> >>>>>However I thought you were suggesting a situation where the block
>> >>>>>device
>> >>>>>could succeed with the request, but knew that area of the device was of
>> >>>>>low
>> >>>>>quality.
>> >>>>I guess that is what I was trying to suggest, though not very clearly.
>> >>>>
>> >>>>>e.g. IO to a block on a stripe which had one 'bad block'.  The IO
>> >>>>>should
>> >>>>>succeed, but the data isn't as safe as elsewhere.  It would be nice if
>> >>>>>we
>> >>>>>could tell the filesystem that fact, and if it could make use of it.
>> >>>>>But
>> >>>>>we
>> >>>>>currently cannot.   We can say "success" or "failure", but we cannot
>> >>>>>say
>> >>>>>"success, but you might not be so lucky next time".
>> >>>>>
>> >>>>Do filesystems re-try reads when there is a failure?  Could you return
>> >>>>fail on one read, then success on a re-read, which could be interpreted
>> >>>>as "dying, but not yet dead" by the file system?
>> >>>This should not be a file system feature. The file system is built upon
>> >>>the raid, and in mirrorred raid types like raid1 and raid10, and also
>> >>>other raid types, you cannot be sure which specific drive and sector the
>> >>>data was read from - it could be one out of many (typically two) places.
>> >>>So the bad blocks of a raid is a feature of the raid and its individual
>> >>>drives, not the file system. If it was a property of the file system,
>> >>>then the fs should be aware of the underlying raid topology, and know if
>> >>>this was a parity block or data block of raid5 or raid6, or which
>> >>>mirror instance of a raid1/10 type which  was involved.
>> >>>
>> >>Thanks for the explanation.
>> >>
>> >>I guess my worry is that if md layer has tracked a bad block on a disk,
>> >>then that stripe will be in a degraded mode.  It's great that it will
>> >>still work, and it's great that the bad block list means that it is
>> >>/only/ that stripe that is degraded - not the whole raid.
>> >I am proposing that the stripe not be degraded, using a recovery area for
>> >bad
>> >blocks on the disk, that goes together with the metadata area.
>> >
>> >>But I'm hoping there can be some sort of relocation somewhere
>> >>(ultimately it doesn't matter if it is handled by the file system, or by
>> >>md for the whole stripe, or by md for just that disk block, or by the
>> >>disk itself), so that you can get raid protection again for that stripe.
>> >I think we agree in hoping:-)
>>
>> IMHO the point is that this feature (Bad Block Log) is a GREAT feature
>> as it just helps in keeping track of the health status of the underlying
>> disks, and helps A LOT in recovering data from the array when a
>> unrecoverable read error occurs (now the full array goes offline). Then
>> something must be done proactively to repair the situation, as it means
>> that a disk of the array has problems and should be replaced. So, first
>> it's worth to make a backup of the still alive array (getting some read
>> error when the bad blocks/stripes are encountered [maybe using ddrescue
>> or similar]), then replace the disk, and reconstruct the array; after
>> that a fsck on the filesystem may repair the situation.
>>
>> You may argue that the unrecoverable read error come from just very few
>> sector of the disk, and it's not worth to replace it (personally I would
>> replace also on very few ones), as there are still many reserverd
>> sectors for relocation on the disk. Then a simple solution would just be
>> to zero-write the bad blocks in the Bad Block Log (the data is gone
>> already): if the write succedes (disk uses reserved sectors for
>> relocation), the blocks are removed from the log (now they are ok); then
>> fsck (hopefully) may repair the filesystem. At this point there are no
>> more md read erros, maybe just filesystem errors (the array is clean,
>> the filesystem may be not, but notice that nothing can be done to avoid
>> filesystem problems, as there has been a data loss; only fsck may help).
>
> another way around, if the badblocks recovery area does not fly with
> Neil or other implementors.
>
> It should be possible to run a periodic check of if any bad sectors have
> occurred in an array. Then the half-damaged file should be moved away from
> this area with the bad block by copying it and relinking it, and before
> relinking it to the proper place the good block corresponding to the bad
> block should be marked as a corresponding good block on the healthy disk
> drive, so that it not be allocated again. This action could even be
> triggered by the event of the detection of the bad block. This would
> probably meean that ther need to be a system call to mark a
> corresponding good block. The whole thing should be able to run in
> userland and somewhat independent of the file system type, except for
> the lookup of the corresponding file fram a damaged block.
>
> best regards
> Keld
>
> best regards
> keld
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html