Re: Dirty data loss after cache disk error recovery

Kai Krakow <kai@xxxxxxxxxxx> · Wed, 28 Apr 2021 20:51:09 +0200

> I think this behavior was introduced by https://lwn.net/Articles/748226/
>
> So above is my late review. ;-)
>
> (around commit 7e027ca4b534b6b99a7c0471e13ba075ffa3f482 if you cannot
> access LWN for reasons[tm])

The problem may actually come from a different code path which retires
the cache on metadata error:

commit 804f3c6981f5e4a506a8f14dc284cb218d0659ae
"bcache: fix cached_dev->count usage for bch_cache_set_error()"

It probably should consider if there's any dirty data. As a first
step, it may be sufficient to run a BUG_ON(there_is_dirty_data) (this
would kill the bcache thread, may not be a good idea) or even freeze
the system with an unrecoverable error, or at least stop the device to
prevent any IO with possibly stale data (because retiring throws away
dirty data). A good solution would be if the "with dirty data" error
path could somehow force the attached file system into read-only mode,
maybe by just reporting IO errors when this bdev is accessed through
bcache.

Thanks,
Kai