Re: Dirty data loss after cache disk error recovery

Coly Li <colyli@xxxxxxx> · Fri, 7 May 2021 20:11:50 +0800

On 4/29/21 2:51 AM, Kai Krakow wrote:
>> I think this behavior was introduced by https://lwn.net/Articles/748226/
>>
>> So above is my late review. ;-)
>>
>> (around commit 7e027ca4b534b6b99a7c0471e13ba075ffa3f482 if you cannot
>> access LWN for reasons[tm])
> 
> The problem may actually come from a different code path which retires
> the cache on metadata error:
> 
> commit 804f3c6981f5e4a506a8f14dc284cb218d0659ae
> "bcache: fix cached_dev->count usage for bch_cache_set_error()"
> 
> It probably should consider if there's any dirty data. As a first
> step, it may be sufficient to run a BUG_ON(there_is_dirty_data) (this
> would kill the bcache thread, may not be a good idea) or even freeze
> the system with an unrecoverable error, or at least stop the device to
> prevent any IO with possibly stale data (because retiring throws away
> dirty data). A good solution would be if the "with dirty data" error
> path could somehow force the attached file system into read-only mode,
> maybe by just reporting IO errors when this bdev is accessed through
> bcache.

There is an option to panic the system when cache device failed. It is
in errors file with available options as "unregister" and "panic". This
option is default set to "unregister", if you set it to "panic" then
panic() will be called.

If the cache set is attached, read-only the bcache device does not
prevent the meta data I/O on cache device (when try to cache the reading
data), if the cache device is really disconnected that will be
problematic too.

The "auto" and "always" options are for "unregister" error action. When
I enhance the device failure handling, I don't add new error action, all
my work was to make the "unregister" action work better.

Adding a new "stop" error action IMHO doesn't make things better. When
the cache device is disconnected, it is always risky that some caching
data or meta data is not updated onto cache device. Permit the cache
device to be re-attached to the backing device may introduce "silent
data loss" which might be worse....  It was the reason why I didn't add
new error action for the device failure handling patch set.

Thanks.

Coly Li