Hi! > There is an option to panic the system when cache device failed. It is > in errors file with available options as "unregister" and "panic". This > option is default set to "unregister", if you set it to "panic" then > panic() will be called. Hmm, okay, I didn't find "panic" documented somewhere. I'll take a look at it again. If it's missing, I'll create a patch to improve documentation. > If the cache set is attached, read-only the bcache device does not > prevent the meta data I/O on cache device (when try to cache the reading > data), if the cache device is really disconnected that will be > problematic too. I didn't completely understand the sentence, it seems to miss a word. But whatever it is, it's probably true. ;-) > The "auto" and "always" options are for "unregister" error action. When > I enhance the device failure handling, I don't add new error action, all > my work was to make the "unregister" action work better. But isn't the failure case here that it hits both code paths: The one that unregisters the device, and the one that then retires the cache? > Adding a new "stop" error action IMHO doesn't make things better. When > the cache device is disconnected, it is always risky that some caching > data or meta data is not updated onto cache device. Permit the cache > device to be re-attached to the backing device may introduce "silent > data loss" which might be worse.... It was the reason why I didn't add > new error action for the device failure handling patch set. But we are actually now seeing silent data loss: The system f'ed up somehow, needed a hard reset, and after reboot the bcache device was accessible in cache mode "none" (because they have been unregistered before, and because udev just detected it and you can use bcache without an attached cache in "none" mode), completely hiding the fact that we lost dirty write-back data, it's even not quite obvious that /dev/bcache0 now is detached, cache mode none, but accessible nevertheless. To me, this is quite clearly "silent data loss", especially since the unregister action threw the dirty data away. So this: > Permit the cache > device to be re-attached to the backing device may introduce "silent > data loss" which might be worse.... is actually the situation we are facing currently: Device has been unregistered, after reboot, udev detects it has clean backing device without cache association, using cache mode none, and it is readable and writable just fine: It essentially permitted access to the stale backing device (tho, it didn't re-attach as you outlined, but that's more or less the same situation). Maybe devices that become disassociated from a cache due to IO errors but have dirty data should go to a caching mode "stale", and bcache should refuse to access such devices or throw away their dirty data until I decide to force them back online into the cache set or force discard the dirty data. Then at least I would discover that something went badly wrong. Otherwise, I may not detect that dirty data wasn't written. In the best case, that makes my FS unmountable, in the worst case, some file data is simply lost (aka silent data loss), besides both situations are the worst-case scenario anyways. The whole situation probably comes from udev auto-registering bcache backing devices again, and bcache has no record of why the device was unregistered - it looks clean after such a situation. > Sorry I just find this thread from my INBOX. Hope it is not too late. No worries. ;-) It was already too late when the dirty cache was discarded but I have daily backups. My system is up and running again, but it's probably not a question of IF it happens again but WHEN it does. So I'd like to discuss how we can get a cleaner fail situation because currently it's just unclean because every status is lost after reboot, and devices look clean, and caching mode is simply "none", which is completely fine for the boot process. Thanks, Kai