Re: md road-map: 2011

Roberto Spadim <roberto@xxxxxxxxxxxxx> · Fri, 18 Feb 2011 17:00:27 -0200

again... for realloc we need TRIM command or reserved sectors just for
bad block realloc, TRIM command tell MD what sector isn´t in use, at
WRITE command MD set the sector as inuse, at array creation md set
sector as inuse too. this will only work with ext4 and swap, others
filesystem don´t have TRIM. the solution of others filesystem are
based on not used block, but it´s a internal logic of each filesystem.
i don´t know what is best, TRIM command is nice (we can send TRIM to
disks, this help to make their life bigger) a bad block is a disk
getting smaller and smaller, the disk can realloc badblock. if it
cant, filesystem should realloc it (it have more information about
logic device, it shouldn´t, TRIM command is the information that disk
should have to discart blocks, not a filesystem logic, but... it´s a
option, filesystem can realloc)

2011/2/18 Keld Jørn Simonsen <keld@xxxxxxxxxx>:
> On Fri, Feb 18, 2011 at 10:47:28AM +0100, Giovanni Tessore wrote:
>> On 02/18/2011 03:56 AM, Keld Jørn Simonsen wrote:
>> >On Fri, Feb 18, 2011 at 01:13:32AM +0100, Giovanni Tessore wrote:
>> >>On 02/17/2011 04:44 PM, Keld Jørn Simonsen wrote:
>> >>>It should be possible to run a periodic check of if any bad sectors have
>> >>>occurred in an array. Then the half-damaged file should be moved away
>> >>>from
>> >>>this area with the bad block by copying it and relinking it, and before
>> >>>relinking it to the proper place the good block corresponding to the bad
>> >>>block should be marked as a corresponding good block on the healthy disk
>> >>>drive, so that it not be allocated again. This action could even be
>> >>>triggered by the event of the detection of the bad block. This would
>> >>>probably meean that ther need to be a system call to mark a
>> >>>corresponding good block. The whole thing should be able to run in
>> >>>userland and somewhat independent of the file system type, except for
>> >>>the lookup of the corresponding file fram a damaged block.
>> >>I don't follow this.. if a file has some damaged blocks, they are gone,
>> >>moving it elsewhere does not help.
>> >Remember the file is in a RAID. So you can lose one disk drive and your
>> >data is still intact.
>> >
>> >>And however, this is a task of the filesystem.
>> >No, it is the task of the raid, as it is the raid that gives the
>> >functionality that you can lose a drive and still have your data intact.
>> >the raid level knows what is lost, and  what is still good, and where
>> >this stuff is.
>> >
>> >If we are then operating on the file level, then doing something clever
>> >could
>> >be a cooperation between the raid leven ald the filesystem level, as
>> >described above.
>>
>> Raid of course has this functionality, but at block level; it's agnostic
>> of the filesystem on it (there may be no filesystem at all actually, as
>> for raid over raid); it does not know the word 'file'.
>
> true
>
>> Raid adds SOME level of redundancy, not infinite. If the underlying
>> hardware has damaged sectors over the redundancy level of the raid
>> configuration, data in the stripe is lost; and the hardware probably
>> should be replaced.
>>
>> Unrecoverable read errors FROM MD (those addressed by Bad Block Log
>> feature) only appear when this redudancy level is not enough; for example:
>> - raid 1 in degraded mode with only 1 disk active, read error on the
>> remaning disk
>> - raid 5 in degraded mode, read error on one of the active disks
>> - raid 6 in degraded mode missing 2 disks, read error on one of the
>> active disks
>> - raid 5, read error on the same sector on more than 1 disk
>> - raid 6, read error on the same sector on more than 2 disks
>> - etc ...
>>
>> in this situation nothing can be done neither at md level, nor at
>> filesytem level: data on the block/stripe is lost.
>
> true too.
>
> My idea was to do something when the MD RAID shifts into the degraded
> states listed above. Not when the MD RAID is in the stats listed above,
> and getting yet another error.
>
>>
>> Remeber that the Bad Block Log keeps track of the block/stripes who gave
>> this unrecoverable read error at md level. It has nothing to do with the
>> unreadable sector list of the underlying disks: if raid gets a read
>> error from a disk, it tries to reconstruct data from the other disks,
>> and to rewrite the sector; if it succedes, all is ok for md (it just
>> increments the counter of corrected read errors, which is persistent for
>> superblock > 1.x); otherwise there is a write error, and the disk is
>> marked as failed.
>
> Yes, this is current behaviour.
>
> I propose that this be changed, in conjunctio with a badblock raid
> feature. Supposedly the write (or read) error wil become registered with
> a new badblock log. And there will be generated a report email to the
> administrator or some such with notification of the event, repoting the
> errpr on the disk as a read or write error, at a specific disk drive and
> a specific block.
>
> I would then like a program in userland that from the specified
> information looks up the semi-damaged file in the file system,
> tries to copy the file, and then sets a flag on other healthy blocks
> related the the newly identified badblock for the related badblogs logs
> for the healthy drives, so that it would generate an error if the block
> is attempetd to be used again.
>
> Or alternatively, I would like reallloc of the badblock in the damaged
> drive, given that there be set aside an area of the RAID metadata
> foor badblock realloc (in a manner similar to what is done for many disk
> drive HW. I think I prefer the latter solution.
>
>
>
>>
>> >
>> >>md is just a block device (more reliable than a single disk due to some
>> >>level of redundancy), and it should be indipendent from the kind of file
>> >>system on it (as the file system should be indipendent from the kind of
>> >>block device it resides on [md, hd, flash, iscsi, ...]).
>> >true
>> >
>> >>Then what you suggest should be done for every block device that can
>> >>have bad blocks (that is, every block device). Again, this is a
>> >>filesystem issue. And of which file system type, as there are many?
>> >yes, it is a cooperation between the file system layer, and the raid
>> >layer, I propose this be done in userland.
>> >
>> >>The Bad Block Log allows md to behave 'like' a read hard disk would do
>> >>with smart data:
>> >>- unreadable blocks/stripes are recorded into the log, as unreadable
>> >>sectors are recorder into smart data
>> >>- unrecoverable read errors are reported to the caller for both
>> >>- the device still works if it has unrecoverable read errors for both
>> >>(now the whole md device fails, this is the problem)
>> >>- if a block/stripe if rewritten with success  the block/stripe is
>> >>removed from Bad Block Log (and the counter of relocated blocks/stripes
>> >>is incremented); as if a sector is rewritten with succes on a disk the
>> >>sector is removed from list of unreadable sector, and the counter of
>> >>relocated sector is incremented (smart data)
>> >Smart drives also reallocate bad blocks, hiding the errors from the SW
>> >level.
>>
>> And that is the only natural place where this operation should be done.
>> Suppose you got a unrecoverable read error from md on a block. It means
>> that some sector on one (or more) of the underlying disks gave a read
>> error. If you try to rewrite the md block, the sectors are rewritten to
>> the underlying disk, so either:
>> - all disks write correctly because they could solve the prolem (its a
>> matter of their firmware, maybe relocating the sector on reserved area):
>> block relocated, all OK.
>> - some disks give an error on write (no more space for relocatable
>> errors, or other hw problems): then the disk(s) is(are) marked failed,
>> and must be replaced.
>> There is no need for reserved blocks anywhere else than those of the
>> underlying disks.
>>
>> Having reserved relocable blocks at raid level would be usefull to
>> address another situation: uncorrectable errors on write. But this is
>> another story.
>
> I agree.
>
>> >>A filesystem on a disk does not know what the firmware of the disk does
>> >>about sectors relocation.
>> >>The same applies for a hardware (not fake) raid controller firmware.
>> >>The same should apply for md. It is transparent to the filesystem.
>> >Yes, normally the raid layer and the fs layer are independent.
>> >
>> >But you can add better recovery with what I suggest.
>> >
>> >>IMHO a more interesting issue whould be: a write error occurs on a disk
>> >>participating to an already degraded array; failing the disk would fail
>> >>the whole array. What to do? Put the array into read only mode, still
>> >>allowing read access to data on it for easy backup? In such situation,
>> >>what would do a hardware raid controller?
>> >>
>> >>Hm, yes.... how do behave hardware raid controllers with uncorrectable
>> >>read errors?
>> >>And how they behave with write error on a disk of an already degraded
>> >>array?
>> >>I guess md should replicate these behaviours.
>> >I think we should be more intelligent than ordinary HW RAID:-)
>>
>> I think it is a good point if the software raid had the same features
>> and reliability of those mission critical hw controllers ;-)
>
> yes we can hope for such implementation.
>
> Best regards
> keld
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html