On 4/29/21 2:51 AM, Kai Krakow wrote: >> I think this behavior was introduced by https://lwn.net/Articles/748226/ >> >> So above is my late review. ;-) >> >> (around commit 7e027ca4b534b6b99a7c0471e13ba075ffa3f482 if you cannot >> access LWN for reasons[tm]) > > The problem may actually come from a different code path which retires > the cache on metadata error: > > commit 804f3c6981f5e4a506a8f14dc284cb218d0659ae > "bcache: fix cached_dev->count usage for bch_cache_set_error()" > > It probably should consider if there's any dirty data. As a first > step, it may be sufficient to run a BUG_ON(there_is_dirty_data) (this > would kill the bcache thread, may not be a good idea) or even freeze > the system with an unrecoverable error, or at least stop the device to > prevent any IO with possibly stale data (because retiring throws away > dirty data). A good solution would be if the "with dirty data" error > path could somehow force the attached file system into read-only mode, > maybe by just reporting IO errors when this bdev is accessed through > bcache. There is an option to panic the system when cache device failed. It is in errors file with available options as "unregister" and "panic". This option is default set to "unregister", if you set it to "panic" then panic() will be called. If the cache set is attached, read-only the bcache device does not prevent the meta data I/O on cache device (when try to cache the reading data), if the cache device is really disconnected that will be problematic too. The "auto" and "always" options are for "unregister" error action. When I enhance the device failure handling, I don't add new error action, all my work was to make the "unregister" action work better. Adding a new "stop" error action IMHO doesn't make things better. When the cache device is disconnected, it is always risky that some caching data or meta data is not updated onto cache device. Permit the cache device to be re-attached to the backing device may introduce "silent data loss" which might be worse.... It was the reason why I didn't add new error action for the device failure handling patch set. Thanks. Coly Li