On Thu 26-01-12 07:17:41, Ric Wheeler wrote: > On 01/23/2012 07:36 PM, Dave Chinner wrote: > >On Mon, Jan 23, 2012 at 04:47:09PM -0500, Ted Ts'o wrote: > >>>The thing is, transient write errors tend to be isolated and go away > >>>when a retry occurs (think of IO timeouts when multipath failover > >>>occurs). When non-isolated IO or unrecoverable problems occur (e.g. > >>>no paths left to fail over onto), critical other metadata reads and > >>>writes will fail and shut down the filesystem, thereby terminating > >>>the "try forever" background writeback loop those delayed write > >>>buffers may be in. So the truth is that "trying forever" on write > >>>errors can handle a whole class of write IO errors very > >>>effectively.... > >>So how does XFS decide whether a write should fail and shutdown the > >>file system, or just "try forever"? > >The IO dispatcher decides that. If the dispatcher has handed the IO > >off to the delayed write queue, then failed writes will be tried > >again. If the caller is catching the IO completion (e.g. sync > >writes) or attaching a completion callback (journal IO), then the > >completion context will handle the error appropriately. Journal IO > >errors tend to shutdown the filesystem on the first error, other > >contexts may handle the error, retry or shutdown the filesystem > >depending on their current state when the error occurs. > > > >Reads are even more complex, because ithe dispatch context can be > >within a transaction and the correct error handling is then > >dependent on the current state of the transaction.... > > I think that having retry logic at the file system layer is really > putting the fix in the wrong place. > > Specifically, if we have multipath configured under a file system, > it is up to the multipath logic to handle the failure (and use > another path, retry, etc). If we see a failed IO further up the > stack, it is *really* dead at that point. Yes, that makes sense. Only, if my memory serves well, e.g. with iSCSI we do see transient errors so it's not like they don't happen. > Transient errors on normal drives are also rarely worth re-trying > since pretty much all modern storage devices have firmware that will > have done exhaustive retries on a failed write. Definitely not worth > retrying forever for a normal device. Agreed. But we could still be clever enough to write the data / metadata to a different place. > At one end of the spectrum, think of a box with dozens of storage > devices attached (either via SAN or local S-ATA devices). If we are > doing large, streaming writes, we could get a large amount of memory > dirtied while writing. If that one device dies and we keep that > memory in use for the endless retry loop, we have really cripple the > box which still has multiple happy storage devices and file > systems.... I agree that if we ever decide to keep unwriteable data in memory, kernel has to have a way to get rid of this data if it needs to. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html