BUG: bcache failing on top of degraded RAID-6

Thorsten Knabe <linux@xxxxxxxxxxxxxxxxx> · Tue, 26 Mar 2019 14:21:39 +0100

Hello,

there seems to be a serious problem, when running bcache on top of a
degraded RAID-6 (MD) array. The bcache device /dev/bcache0 disappears
after a few I/O operations on the affected device and the kernel log
gets filled with the following log message:

bcache: bch_count_backing_io_errors() md0: IO error on backing device,
unrecoverable

Setup:
Linux kernel: 5.1-rc2, 5.0.4, 4.19.0-0.bpo.2-amd64 (Debian backports)
all affected
bcache backing device: EXT4 filesystem -> /dev/bcache0 -> /dev/md0 ->
/dev/sd[bcde]1
bcache cache device: /dev/sdf1
cache mode: writethrough, none and cache device detached are all
affected, writeback and writearound has not been tested
KVM for testing, first observed on real hardware (failing RAID device)

As long as the RAID6 is healthy, bcache works as expected. Once the
RAID6 gets degraded, for example by removing a drive from the array
(mdadm --fail /dev/md0 /dev/sde1, mdadm --remove /dev/md0 /dev/sde1),
the above-mentioned log messages appear in the kernel log and the bcache
device /dev/bcache0 disappears shortly afterwards logging:

bcache: bch_cached_dev_error() stop bcache0: too many IO errors on
backing device md0

to the kernel log.

Increasing /sys/block/bcache0/bcache/io_error_limit to a very high value
(1073741824) the bcache device /dev/bcache0 remains usable without any
noticeable filesystem corruptions.

Thanks
Thorsten

-- 
___              
 |        | /                 E-Mail: linux@xxxxxxxxxxxxxxxxx 
 |horsten |/\nabe                WWW: http://linux.thorsten-knabe.de