another question, out of bad block, but related to write-behind in mysql ndb, i can run many machines in the cluster, in a write, if 2 machines return commit OK, nbd put all others machines to async write, it´s nice, because speed is improved and i have 1 computer redudancy, could we implement a diferent write-behind method? i was talking about it in another email thread something like: select what disks must be write-mostly (only read if all mirrors are failed) select what disks MUST be commited (sync) select what disks MUST be write-behind (async) select what disks can be automatic (sync/async, if X disks are commited theses disks are automatic write-behind, after write they get backto non write-behind, idon´t see a solution at userspace, just at kernel space) itwil help a mix of raid1 with slow and fast disks, maybe the problem of accesstime can be reduced for harddisks, the raid1 isn´t more slowest disk speed for write check that write-mostly is for read_balance write-behind for write command 2011/2/17 Keld Jørn Simonsen <keld@xxxxxxxxxx>: > On Thu, Feb 17, 2011 at 12:45:42PM +0100, Giovanni Tessore wrote: >> On 02/17/2011 11:58 AM, Keld Jørn Simonsen wrote: >> >On Thu, Feb 17, 2011 at 11:45:35AM +0100, David Brown wrote: >> >>On 17/02/2011 02:04, Keld Jørn Simonsen wrote: >> >>>On Thu, Feb 17, 2011 at 01:30:49AM +0100, David Brown wrote: >> >>>>On 17/02/11 00:01, NeilBrown wrote: >> >>>>>On Wed, 16 Feb 2011 23:34:43 +0100 David >> >>>>>Brown<david.brown@xxxxxxxxxxxx> >> >>>>>wrote: >> >>>>> >> >>>>>>I thought there was some mechanism for block devices to report bad >> >>>>>>blocks back to the file system, and that file systems tracked bad >> >>>>>>block >> >>>>>>lists. Modern drives automatically relocate bad blocks (at least, >> >>>>>>they >> >>>>>>do if they can), but there was a time when they did not and it was up >> >>>>>>to >> >>>>>>the file system to track these. Whether that still applies to modern >> >>>>>>file systems, I do not know - they only file system I have studied in >> >>>>>>low-level detail is FAT16. >> >>>>>When the block device reports an error the filesystem can certainly >> >>>>>record >> >>>>>that information in a bad-block list, and possibly does. >> >>>>> >> >>>>>However I thought you were suggesting a situation where the block >> >>>>>device >> >>>>>could succeed with the request, but knew that area of the device was of >> >>>>>low >> >>>>>quality. >> >>>>I guess that is what I was trying to suggest, though not very clearly. >> >>>> >> >>>>>e.g. IO to a block on a stripe which had one 'bad block'. The IO >> >>>>>should >> >>>>>succeed, but the data isn't as safe as elsewhere. It would be nice if >> >>>>>we >> >>>>>could tell the filesystem that fact, and if it could make use of it. >> >>>>>But >> >>>>>we >> >>>>>currently cannot. We can say "success" or "failure", but we cannot >> >>>>>say >> >>>>>"success, but you might not be so lucky next time". >> >>>>> >> >>>>Do filesystems re-try reads when there is a failure? Could you return >> >>>>fail on one read, then success on a re-read, which could be interpreted >> >>>>as "dying, but not yet dead" by the file system? >> >>>This should not be a file system feature. The file system is built upon >> >>>the raid, and in mirrorred raid types like raid1 and raid10, and also >> >>>other raid types, you cannot be sure which specific drive and sector the >> >>>data was read from - it could be one out of many (typically two) places. >> >>>So the bad blocks of a raid is a feature of the raid and its individual >> >>>drives, not the file system. If it was a property of the file system, >> >>>then the fs should be aware of the underlying raid topology, and know if >> >>>this was a parity block or data block of raid5 or raid6, or which >> >>>mirror instance of a raid1/10 type which was involved. >> >>> >> >>Thanks for the explanation. >> >> >> >>I guess my worry is that if md layer has tracked a bad block on a disk, >> >>then that stripe will be in a degraded mode. It's great that it will >> >>still work, and it's great that the bad block list means that it is >> >>/only/ that stripe that is degraded - not the whole raid. >> >I am proposing that the stripe not be degraded, using a recovery area for >> >bad >> >blocks on the disk, that goes together with the metadata area. >> > >> >>But I'm hoping there can be some sort of relocation somewhere >> >>(ultimately it doesn't matter if it is handled by the file system, or by >> >>md for the whole stripe, or by md for just that disk block, or by the >> >>disk itself), so that you can get raid protection again for that stripe. >> >I think we agree in hoping:-) >> >> IMHO the point is that this feature (Bad Block Log) is a GREAT feature >> as it just helps in keeping track of the health status of the underlying >> disks, and helps A LOT in recovering data from the array when a >> unrecoverable read error occurs (now the full array goes offline). Then >> something must be done proactively to repair the situation, as it means >> that a disk of the array has problems and should be replaced. So, first >> it's worth to make a backup of the still alive array (getting some read >> error when the bad blocks/stripes are encountered [maybe using ddrescue >> or similar]), then replace the disk, and reconstruct the array; after >> that a fsck on the filesystem may repair the situation. >> >> You may argue that the unrecoverable read error come from just very few >> sector of the disk, and it's not worth to replace it (personally I would >> replace also on very few ones), as there are still many reserverd >> sectors for relocation on the disk. Then a simple solution would just be >> to zero-write the bad blocks in the Bad Block Log (the data is gone >> already): if the write succedes (disk uses reserved sectors for >> relocation), the blocks are removed from the log (now they are ok); then >> fsck (hopefully) may repair the filesystem. At this point there are no >> more md read erros, maybe just filesystem errors (the array is clean, >> the filesystem may be not, but notice that nothing can be done to avoid >> filesystem problems, as there has been a data loss; only fsck may help). > > another way around, if the badblocks recovery area does not fly with > Neil or other implementors. > > It should be possible to run a periodic check of if any bad sectors have > occurred in an array. Then the half-damaged file should be moved away from > this area with the bad block by copying it and relinking it, and before > relinking it to the proper place the good block corresponding to the bad > block should be marked as a corresponding good block on the healthy disk > drive, so that it not be allocated again. This action could even be > triggered by the event of the detection of the bad block. This would > probably meean that ther need to be a system call to mark a > corresponding good block. The whole thing should be able to run in > userland and somewhat independent of the file system type, except for > the lookup of the corresponding file fram a damaged block. > > best regards > Keld > > best regards > keld > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Roberto Spadim Spadim Technology / SPAEmpresarial -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html