Cache Device Failure Expectations

Marc Smith <msmith626@xxxxxxxxx> · Mon, 22 Mar 2021 11:09:17 -0500

Hi,

I'm using bcache in a Linux 5.4.69 kernel, and I'm testing transient
cache device failures with a backing backing device using 'writeback'
mode, and with several gigabytes of dirty data (that has not reached
the backing device).

In my first test, the cache devices are using the default "unregister"
value for the "errors" sysfs attribute knob (for bcache cache devices
in /sys/fs/bcache/...). When I induce a cache device failure, bcache
backing devices stop, the cache device is detached from all affected
backing devices, and I/O errors are returned on subsequent access
attempts to the backing devices. This all works as I think it would
based on how it's configured.

The downside to "unregister" is when I reboot the system (with the
cache block device reinstated/working), the backing devices come up
but with no cache device attached! So this certainly causes file
system corruption as dirty data is not present on the backing device
(since the backing device is started without the cache device).

On the second test run, I used "panic" for the "unregister" sysfs
value, and this works cleaner, most of the time. When I induce a cache
block device failure, the system then panics, but the cache device
stays associated with the backing devices -- and dirty data can then
flush to the backing device. On this second test, when the system
booted back up, one cache device failed to start:
...
[ 333.116149] bcache: prio_read() bad csum reading priorities
[ 333.116151] bcache: prio_read() bad magic reading priorities
[ 333.116636] bcache: bch_cache_set_error() bcache: error on
2f255344-bb44-44b9-930d-90f23b384e9c:
[ 333.116637] corrupted btree at bucket 473, block 44, 504 keys
[ 333.116638] bcache: bch_cache_set_error() , disabling caching
[ 333.116638]
[ 333.116649] bcache: register_cache() error dm-12: failed to run cache set
[ 333.116650] bcache: register_bcache() error : failed to register device
...

This seemed to be a temporary problem -- I rebooted the system again,
and then the bcache cache device started without issue. I did not
check for data loss / corruption in this instance.

A third test run using "panic" mode resulted in everything coming back
up normally, and seemingly operating just fine (no cache/backing
device start errors). I did not check for data loss / corruption in
this instance either.

So, I guess just a couple questions to solidify my expectations on
this type of transient cache device failure (cache block device fails,
but then can come back later fully intact):
- It sounds like for handling this case, "panic" mode for the "errors"
sysfs attribute is best since it does not detach the cache device from
backing devices
- Is this safe/reliable (transient cache device failures)? Obviously
it's not preferred, but should I expect any problems should this occur
and using "panic" mode? No metadata corruption on the cache device is
expected?

Thanks for your time. Appreciate the great work on bcache!

--Marc