Excerpts from Jan Schmidt's message of 2011-03-17 10:46:43 -0400: > Hello everyone, > > Currently, btrfs has its own raid1 but no repair mechanism for bad > checksums or EIOs. While trying to implement such a repair mechanism, > several more or less general questions came up. > > There are two different retry paths for data and metadata. (If you know > or don't care how btrfs handles read errors: goto questions) You should talk with Ilya, who is working on replacing failed raid drives as well. > > The data path: btrfs_io_failed_hook is called for each failed bio (EIO > or checksum error). Currently, it does not know which mirror failed at > first, because normally btrfs_map_block is called with mirror_num=0, > leading to a path where find_live_mirror picks one of them. The error > recovery strategy is then to explicitly read available mirrors one after > the other until one succeeds. In case the very first read picked mirror > 1 and failed, the retry code will most likely fail at mirror 1 as well. > It would be nice to know which mirror was picked formerly and directly > try the other. Agree with Josef here, change the code to record which one was used. The current bio submission stuff only keeps the btrfs_multi_bio struct around when a given IO spans more than one disk. But you can easily change it to keep the struct around for all IOs. > > The metadata path: there is no failure hook, instead there is a loop in > btree_read_extent_buffer_pages, also starting off at mirror_num=0, which > again leaves the decision to find_live_mirror. If there is an error for > any page to be read, the same retry strategy is used as is in the data > path. This obviously might leave you alone with unreadable data > (consider page x is bad on mirror 1 and page x+1 is bad on mirror 2, > both belonging to the same extent, you lose). It would be nice to have a > mechanism at a lower level issuing page-sized retries. Of course, > knowing which mirror is bad before trying mirror 1 again is desirable as > well. Currently the block size is always smaller than the stripe size. But you have a good point. > > questions: > I have a raid1 repair solution in mind (partially coded) for btrfs that > can be implemented quite easily. However, I have some misgivings. All of > the following questions would need a "yes" for my solution to stand: > > - Is it acceptable to retry reading a block immediately after the disk > said it won't work? Or in case of a successful read followed by a > checksum error? (Which is already being done right now in btrfs.) In the initial implementation sure, but long term it's not the best. > > - Is it acceptable to always write both mirrors if one is found to be > bad (also consider ssds)? Sorry, I'd rather not overwrite the copy we know to be good. > > If either of the answers is "no", tracking where the initial read came > from seems inevitable. Tracking would be very easy if bios came back > with unmodified values in bd_bdev and bd_sector, which is not the case. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html