I think we should be more intelligent than ordinary HW RAID:-) that´s why SW RAID is better =) ========== some IDEAS for badblock: ========== the point here is: MD is a virtual harddisk, and must operate like one harddisk (or ssd disk, or mixed ssd+hd) raid0 = many disks (inside a hard disk we have many disks and many heads, right? it´s something like it without the control of head positioning and SATA interface) raid1 = something that don´t exist in harddisks: mirrors (maybe some disks use it as a badblock solution and we don´t know, but mirrors are used today as a device redundancy) raid456 = a ecc or checksum of the 'disk'?! maybe something like it... the badblock problem/solution: many disks have it internally, some disks have online reallocate, some report the block is failed and filesystem must workaround it (disks report that the block is failed because they couldn´t reallocate or don´t have this feature, filesystem must stop or write to another device, maybe a end of space problem for app level... in mosts cases user must tell what to do, or report to kernel log or user space log). my opnion... since badblock is a device block problem, md must handle this problem (today with mirror marked as failed, in near future with badblocks) filesystem must know that if a badblock exist the device will get smaller (with less space) some filesystem know about the badblock problem and try to store the information on another sector without data (filesystem should report the nonused sector to device with a TRIM command, but they don´t (today just ext4 with discard and swap have it)) for sw raid badblock... maybe we need not only badblock list, but online realloc (for the first step a badblock list is good, for second step a online realloc) types of realloc (in all cases md array must will get smaller, if not, mark the smaller mirror as 'badblocked') 1)device realloc: reallocate device block just on the bad mirror 2)md mirror realloc: if one mirror have a badblock, md will use another mirror to read that sector (maybe the first option is better, but when the first fail we use this option and mark md array as 'insync with badblocks' maybe a per mirror flag about bad block is a nice feature, and maybe a per mirror export badblock list is nice too, for md level a badblock list of virtual badblock (all mirror with the same sector badblock, like a single harddisk without mirror)) check that we must implement layout in all raid levels (badblock realocation is a dynamic layout) check that for realocation we must use a TRIM like command (here our friend told that badblock is a filesystem problem, i don´t see as a filesystem problem, but device and filesystem problem, since SSD can use non allocated sectors and optimize the speed and life time of NAND cells) the trim command tell us if a sector with 00000000000...000 value is in use or not when in use? when we write to sector when not in use? at array startup, when filesystem send a trim command to a MD device and MD mark the block as not in use. check that MD must not send TRIM command to data blocks, it can send TRIM to parity devices (raid456), not in use = 0 to TRIM bit + 0000 to sector bytes howto make trim command? internally we need a 0/1 bit value that tell us the block is in use or not. the problem: for file system that need a block size of 4096bytes = md will use 4096+1bit the first solution (a filesystem problem): use 4095block size for filesystem and use 1 byte for trim information second solution (a md problem): group many bits in one block. 4096bytes=32768 bits in a block of 4096bytes, after 32768blocks we have a TRIM block check that TRIM block(bit) can be in a badblock hehehe =P check this for more ideas: http://en.wikipedia.org/wiki/TRIM_%28SSD_command%29 http://t13.org/Documents/UploadedDocuments/docs2008/e07154r6-Data_Set_Management_Proposal_for_ATA-ACS2.doc when filesystem should know if a block have problem (a badblock)? when all disks have the badblock and can´t be reallocated, in other words, when device get smaller (with less space) 2011/2/18 Keld Jørn Simonsen <keld@xxxxxxxxxx>: > On Fri, Feb 18, 2011 at 01:13:32AM +0100, Giovanni Tessore wrote: >> On 02/17/2011 04:44 PM, Keld Jørn Simonsen wrote: >> >It should be possible to run a periodic check of if any bad sectors have >> >occurred in an array. Then the half-damaged file should be moved away from >> >this area with the bad block by copying it and relinking it, and before >> >relinking it to the proper place the good block corresponding to the bad >> >block should be marked as a corresponding good block on the healthy disk >> >drive, so that it not be allocated again. This action could even be >> >triggered by the event of the detection of the bad block. This would >> >probably meean that ther need to be a system call to mark a >> >corresponding good block. The whole thing should be able to run in >> >userland and somewhat independent of the file system type, except for >> >the lookup of the corresponding file fram a damaged block. >> >> I don't follow this.. if a file has some damaged blocks, they are gone, >> moving it elsewhere does not help. > > Remember the file is in a RAID. So you can lose one disk drive and your > data is still intact. > >> And however, this is a task of the filesystem. > > No, it is the task of the raid, as it is the raid that gives the > functionality that you can lose a drive and still have your data intact. > the raid level knows what is lost, and what is still good, and where > this stuff is. > > If we are then operating on the file level, then doing something clever could > be a cooperation between the raid leven ald the filesystem level, as > described above. > > >> md is just a block device (more reliable than a single disk due to some >> level of redundancy), and it should be indipendent from the kind of file >> system on it (as the file system should be indipendent from the kind of >> block device it resides on [md, hd, flash, iscsi, ...]). > > true > >> Then what you suggest should be done for every block device that can >> have bad blocks (that is, every block device). Again, this is a >> filesystem issue. And of which file system type, as there are many? > > yes, it is a cooperation between the file system layer, and the raid > layer, I propose this be done in userland. > >> The Bad Block Log allows md to behave 'like' a read hard disk would do >> with smart data: >> - unreadable blocks/stripes are recorded into the log, as unreadable >> sectors are recorder into smart data >> - unrecoverable read errors are reported to the caller for both >> - the device still works if it has unrecoverable read errors for both >> (now the whole md device fails, this is the problem) >> - if a block/stripe if rewritten with success the block/stripe is >> removed from Bad Block Log (and the counter of relocated blocks/stripes >> is incremented); as if a sector is rewritten with succes on a disk the >> sector is removed from list of unreadable sector, and the counter of >> relocated sector is incremented (smart data) > > Smart drives also reallocate bad blocks, hiding the errors from the SW > level. > >> A filesystem on a disk does not know what the firmware of the disk does >> about sectors relocation. >> The same applies for a hardware (not fake) raid controller firmware. >> The same should apply for md. It is transparent to the filesystem. > > Yes, normally the raid layer and the fs layer are independent. > > But you can add better recovery with what I suggest. > >> IMHO a more interesting issue whould be: a write error occurs on a disk >> participating to an already degraded array; failing the disk would fail >> the whole array. What to do? Put the array into read only mode, still >> allowing read access to data on it for easy backup? In such situation, >> what would do a hardware raid controller? >> >> Hm, yes.... how do behave hardware raid controllers with uncorrectable >> read errors? >> And how they behave with write error on a disk of an already degraded array? >> I guess md should replicate these behaviours. > > I think we should be more intelligent than ordinary HW RAID:-) > > Best regards > keld > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Roberto Spadim Spadim Technology / SPAEmpresarial -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html