Hi Coly Li-- On 02/27/2018 08:55 AM, Coly Li wrote: > When too many I/Os failed on cache device, bch_cache_set_error() is called > in the error handling code path to retire whole problematic cache set. If > new I/O requests continue to come and take refcount dc->count, the cache > set won't be retired immediately, this is a problem. > > Further more, there are several kernel thread and self-armed kernel work > may still running after bch_cache_set_error() is called. It needs to wait > quite a while for them to stop, or they won't stop at all. They also > prevent the cache set from being retired. It's too bad this is necessary-- I wish the IO layer could latch error for us in some kind of meaningful way instead of us having to do it ourselves (and for filesystems, etc, having to each do similar things to prevent just continuously hitting IO timeouts). That said, the code looks good. Reviewed-by: Michael Lyle <mlyle@xxxxxxxx>