On Mon, Feb 18, 2019 at 06:55:20PM -0800, Darrick J. Wong wrote: > On Tue, Feb 19, 2019 at 08:31:50AM +1100, Dave Chinner wrote: > > On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote: > > > Motivation: > > > When fs data/metadata checksum mismatch, lower block devices may have other > > > correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but > > > decides that the metadata is garbage, today it will shut down the entire > > > filesystem without trying any of the other mirrors. This is a severe > > > loss of service, and we propose these patches to have XFS try harder to > > > avoid failure. > > > > > > This patch prototype this mirror retry idea by: > > > * Adding @nr_mirrors to struct request_queue which is similar as > > > blk_queue_nonrot(), filesystem can grab device request queue and check max > > > mirrors this block device has. > > > Helper functions were also added to get/set the nr_mirrors. > > > > > > * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap > > > in order to support stacked layer case. > > > > > > * Modify md/raid1 to support this retry feature. > > > > > > * Adapter xfs to use this feature. > > > If the read verify fails, we loop over the available mirrors and retry the read. > > > > Why does the filesystem have to iterate every single posible > > combination of devices that are underneath it? > > > > Wouldn't it be much simpler to be able to attach a verifier > > function to the bio, and have each layer that gets called iterate > > over all it's copies internally until the verfier function passes > > or all copies are exhausted? > > > > This works for stacked mirrors - it can pass the higher layer > > verifier down as far as necessary. It can work for RAID5/6, too, by > > having that layer supply it's own verifier for reads that verifies > > parity and can reconstruct of failure, then when it's reconstructed > > a valid stripe it can run the verifier that was supplied to it from > > above, etc. > > > > i.e. I dont see why only filesystems should drive retries or have to > > be aware of the underlying storage stacking. ISTM that each > > layer of the storage stack should be able to verify what has been > > returned to it is valid independently of the higher layer > > requirements. The only difference from a caller point of view should > > be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func); > > What if instead of constructing a giant pile of verifier call chain, we > simply had a return value from ->bi_end_io that would then be returned > from bio_endio()? Conceptually it acheives the same thing - getting the high level verifier status down to the lower layer to say "this copy is bad, try again", but I suspect all the bio chaining and cloning done in the stack makes this much more difficult than it seems. > Stacked things like dm-linear would have to know how to connect > the upper endio to the lower endio though. And that could have > its downsides, too. Stacking always makes things hard :/ > How long do we tie up resources in the scsi > layer while upper levels are busy running verification functions...? I suspect there's a more important issue to worry about: we run the XFS read verifiers in an async work queue context after collecting the IO completion status from the bio, rather than running directly in bio->bi_end_io() call chain. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx