again... for realloc we need TRIM command or reserved sectors just for bad block realloc, TRIM command tell MD what sector isn´t in use, at WRITE command MD set the sector as inuse, at array creation md set sector as inuse too. this will only work with ext4 and swap, others filesystem don´t have TRIM. the solution of others filesystem are based on not used block, but it´s a internal logic of each filesystem. i don´t know what is best, TRIM command is nice (we can send TRIM to disks, this help to make their life bigger) a bad block is a disk getting smaller and smaller, the disk can realloc badblock. if it cant, filesystem should realloc it (it have more information about logic device, it shouldn´t, TRIM command is the information that disk should have to discart blocks, not a filesystem logic, but... it´s a option, filesystem can realloc) 2011/2/18 Keld Jørn Simonsen <keld@xxxxxxxxxx>: > On Fri, Feb 18, 2011 at 10:47:28AM +0100, Giovanni Tessore wrote: >> On 02/18/2011 03:56 AM, Keld Jørn Simonsen wrote: >> >On Fri, Feb 18, 2011 at 01:13:32AM +0100, Giovanni Tessore wrote: >> >>On 02/17/2011 04:44 PM, Keld Jørn Simonsen wrote: >> >>>It should be possible to run a periodic check of if any bad sectors have >> >>>occurred in an array. Then the half-damaged file should be moved away >> >>>from >> >>>this area with the bad block by copying it and relinking it, and before >> >>>relinking it to the proper place the good block corresponding to the bad >> >>>block should be marked as a corresponding good block on the healthy disk >> >>>drive, so that it not be allocated again. This action could even be >> >>>triggered by the event of the detection of the bad block. This would >> >>>probably meean that ther need to be a system call to mark a >> >>>corresponding good block. The whole thing should be able to run in >> >>>userland and somewhat independent of the file system type, except for >> >>>the lookup of the corresponding file fram a damaged block. >> >>I don't follow this.. if a file has some damaged blocks, they are gone, >> >>moving it elsewhere does not help. >> >Remember the file is in a RAID. So you can lose one disk drive and your >> >data is still intact. >> > >> >>And however, this is a task of the filesystem. >> >No, it is the task of the raid, as it is the raid that gives the >> >functionality that you can lose a drive and still have your data intact. >> >the raid level knows what is lost, and what is still good, and where >> >this stuff is. >> > >> >If we are then operating on the file level, then doing something clever >> >could >> >be a cooperation between the raid leven ald the filesystem level, as >> >described above. >> >> Raid of course has this functionality, but at block level; it's agnostic >> of the filesystem on it (there may be no filesystem at all actually, as >> for raid over raid); it does not know the word 'file'. > > true > >> Raid adds SOME level of redundancy, not infinite. If the underlying >> hardware has damaged sectors over the redundancy level of the raid >> configuration, data in the stripe is lost; and the hardware probably >> should be replaced. >> >> Unrecoverable read errors FROM MD (those addressed by Bad Block Log >> feature) only appear when this redudancy level is not enough; for example: >> - raid 1 in degraded mode with only 1 disk active, read error on the >> remaning disk >> - raid 5 in degraded mode, read error on one of the active disks >> - raid 6 in degraded mode missing 2 disks, read error on one of the >> active disks >> - raid 5, read error on the same sector on more than 1 disk >> - raid 6, read error on the same sector on more than 2 disks >> - etc ... >> >> in this situation nothing can be done neither at md level, nor at >> filesytem level: data on the block/stripe is lost. > > true too. > > My idea was to do something when the MD RAID shifts into the degraded > states listed above. Not when the MD RAID is in the stats listed above, > and getting yet another error. > >> >> Remeber that the Bad Block Log keeps track of the block/stripes who gave >> this unrecoverable read error at md level. It has nothing to do with the >> unreadable sector list of the underlying disks: if raid gets a read >> error from a disk, it tries to reconstruct data from the other disks, >> and to rewrite the sector; if it succedes, all is ok for md (it just >> increments the counter of corrected read errors, which is persistent for >> superblock > 1.x); otherwise there is a write error, and the disk is >> marked as failed. > > Yes, this is current behaviour. > > I propose that this be changed, in conjunctio with a badblock raid > feature. Supposedly the write (or read) error wil become registered with > a new badblock log. And there will be generated a report email to the > administrator or some such with notification of the event, repoting the > errpr on the disk as a read or write error, at a specific disk drive and > a specific block. > > I would then like a program in userland that from the specified > information looks up the semi-damaged file in the file system, > tries to copy the file, and then sets a flag on other healthy blocks > related the the newly identified badblock for the related badblogs logs > for the healthy drives, so that it would generate an error if the block > is attempetd to be used again. > > Or alternatively, I would like reallloc of the badblock in the damaged > drive, given that there be set aside an area of the RAID metadata > foor badblock realloc (in a manner similar to what is done for many disk > drive HW. I think I prefer the latter solution. > > > >> >> > >> >>md is just a block device (more reliable than a single disk due to some >> >>level of redundancy), and it should be indipendent from the kind of file >> >>system on it (as the file system should be indipendent from the kind of >> >>block device it resides on [md, hd, flash, iscsi, ...]). >> >true >> > >> >>Then what you suggest should be done for every block device that can >> >>have bad blocks (that is, every block device). Again, this is a >> >>filesystem issue. And of which file system type, as there are many? >> >yes, it is a cooperation between the file system layer, and the raid >> >layer, I propose this be done in userland. >> > >> >>The Bad Block Log allows md to behave 'like' a read hard disk would do >> >>with smart data: >> >>- unreadable blocks/stripes are recorded into the log, as unreadable >> >>sectors are recorder into smart data >> >>- unrecoverable read errors are reported to the caller for both >> >>- the device still works if it has unrecoverable read errors for both >> >>(now the whole md device fails, this is the problem) >> >>- if a block/stripe if rewritten with success the block/stripe is >> >>removed from Bad Block Log (and the counter of relocated blocks/stripes >> >>is incremented); as if a sector is rewritten with succes on a disk the >> >>sector is removed from list of unreadable sector, and the counter of >> >>relocated sector is incremented (smart data) >> >Smart drives also reallocate bad blocks, hiding the errors from the SW >> >level. >> >> And that is the only natural place where this operation should be done. >> Suppose you got a unrecoverable read error from md on a block. It means >> that some sector on one (or more) of the underlying disks gave a read >> error. If you try to rewrite the md block, the sectors are rewritten to >> the underlying disk, so either: >> - all disks write correctly because they could solve the prolem (its a >> matter of their firmware, maybe relocating the sector on reserved area): >> block relocated, all OK. >> - some disks give an error on write (no more space for relocatable >> errors, or other hw problems): then the disk(s) is(are) marked failed, >> and must be replaced. >> There is no need for reserved blocks anywhere else than those of the >> underlying disks. >> >> Having reserved relocable blocks at raid level would be usefull to >> address another situation: uncorrectable errors on write. But this is >> another story. > > I agree. > >> >>A filesystem on a disk does not know what the firmware of the disk does >> >>about sectors relocation. >> >>The same applies for a hardware (not fake) raid controller firmware. >> >>The same should apply for md. It is transparent to the filesystem. >> >Yes, normally the raid layer and the fs layer are independent. >> > >> >But you can add better recovery with what I suggest. >> > >> >>IMHO a more interesting issue whould be: a write error occurs on a disk >> >>participating to an already degraded array; failing the disk would fail >> >>the whole array. What to do? Put the array into read only mode, still >> >>allowing read access to data on it for easy backup? In such situation, >> >>what would do a hardware raid controller? >> >> >> >>Hm, yes.... how do behave hardware raid controllers with uncorrectable >> >>read errors? >> >>And how they behave with write error on a disk of an already degraded >> >>array? >> >>I guess md should replicate these behaviours. >> >I think we should be more intelligent than ordinary HW RAID:-) >> >> I think it is a good point if the software raid had the same features >> and reliability of those mission critical hw controllers ;-) > > yes we can hope for such implementation. > > Best regards > keld > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Roberto Spadim Spadim Technology / SPAEmpresarial -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html