Re: BUG: bcache failing on top of degraded RAID-6

Thorsten Knabe <linux@xxxxxxxxxxxxxxxxx> · Tue, 7 May 2019 15:48:02 +0200

On 5/7/19 3:07 PM, Coly Li wrote:
> On 2019/5/7 9:01 下午, Thorsten Knabe wrote:
>> On 5/7/19 2:23 PM, Coly Li wrote:
>>> On 2019/5/7 8:19 下午, Thorsten Knabe wrote:
>>>> On 3/27/19 2:45 PM, Coly Li wrote:
>>>>> On 2019/3/27 9:42 下午, Thorsten Knabe wrote:
>>>>>> On 3/27/19 12:53 PM, Coly Li wrote:
>>>>>>> On 2019/3/27 7:00 下午, Thorsten Knabe wrote:
>>>>>>>> On 3/27/19 10:44 AM, Coly Li wrote:
>>>>>>>>> On 2019/3/26 9:21 下午, Thorsten Knabe wrote:
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> there seems to be a serious problem, when running bcache on top of a
>>>>>>>>>> degraded RAID-6 (MD) array. The bcache device /dev/bcache0 disappears
>>>>>>>>>> after a few I/O operations on the affected device and the kernel log
>>>>>>>>>> gets filled with the following log message:
>>>>>>>>>>
>>>>>>>>>> bcache: bch_count_backing_io_errors() md0: IO error on backing device,
>>>>>>>>>> unrecoverable
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It seems I/O request onto backing device failed. If the md raid6 device
>>>>>>>>> is the backing device, does it go into read-only mode after degrade ?
>>>>>>>>
>>>>>>>> No, the RAID6 backing device is still in read-write mode after the disk
>>>>>>>> has been removed from the RAID array. That's the way RAID6 is supposed
>>>>>>>> to work.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Setup:
>>>>>>>>>> Linux kernel: 5.1-rc2, 5.0.4, 4.19.0-0.bpo.2-amd64 (Debian backports)
>>>>>>>>>> all affected
>>>>>>>>>> bcache backing device: EXT4 filesystem -> /dev/bcache0 -> /dev/md0 ->
>>>>>>>>>> /dev/sd[bcde]1
>>>>>>>>>> bcache cache device: /dev/sdf1
>>>>>>>>>> cache mode: writethrough, none and cache device detached are all
>>>>>>>>>> affected, writeback and writearound has not been tested
>>>>>>>>>> KVM for testing, first observed on real hardware (failing RAID device)
>>>>>>>>>>
>>>>>>>>>> As long as the RAID6 is healthy, bcache works as expected. Once the
>>>>>>>>>> RAID6 gets degraded, for example by removing a drive from the array
>>>>>>>>>> (mdadm --fail /dev/md0 /dev/sde1, mdadm --remove /dev/md0 /dev/sde1),
>>>>>>>>>> the above-mentioned log messages appear in the kernel log and the bcache
>>>>>>>>>> device /dev/bcache0 disappears shortly afterwards logging:
>>>>>>>>>>
>>>>>>>>>> bcache: bch_cached_dev_error() stop bcache0: too many IO errors on
>>>>>>>>>> backing device md0
>>>>>>>>>>
>>>>>>>>>> to the kernel log.
>>>>>>>>>>
>>>>>>>>>> Increasing /sys/block/bcache0/bcache/io_error_limit to a very high value
>>>>>>>>>> (1073741824) the bcache device /dev/bcache0 remains usable without any
>>>>>>>>>> noticeable filesystem corruptions.
>>>>>>>>>
>>>>>>>>> If the backing device goes into read-only mode, bcache will take this
>>>>>>>>> backing device as a failure status. The behavior is to stop the bcache
>>>>>>>>> device of the failed backing device, to notify upper layer something
>>>>>>>>> goes wrong.
>>>>>>>>>
>>>>>>>>> In writethough and writeback mode, bcache requires the backing device to
>>>>>>>>> be writable.
>>>>>>>>
>>>>>>>> But, the degraded (one disk of the array missing) RAID6 device is still
>>>>>>>> writable.
>>>>>>>>
>>>>>>>> Also after raising the io_error_limit of the bcache device to a very
>>>>>>>> high value (1073741824 in my tests) I can use the bcache device on the
>>>>>>>> degraded RAID6 array for hours reading and writing gigabytes of data,
>>>>>>>> without getting any I/O errors or observing any filesystem corruptions.
>>>>>>>> I'm just getting a lot of those
>>>>>>>>
>>>>>>>> bcache: bch_count_backing_io_errors() md0: IO error on backing device,
>>>>>>>> unrecoverable
>>>>>>>>
>>>>>>>> messages in the kernel log.
>>>>>>>>
>>>>>>>> It seems that I/O requests for data that have been successfully
>>>>>>>> recovered by the RAID6 from the redundant information stored on the
>>>>>>>> additional disks are accidentally counted as failed I/O requests and
>>>>>>>> when the configured io_error_limit for the bcache device is reached, the
>>>>>>>> bcache device gets stopped.
>>>>>>> Oh, thanks for the informaiton.
>>>>>>>
>>>>>>> It sounds during md raid6 degrading and recovering, some I/O from bcache
>>>>>>> might be failed, and after md raid6 degrades and recovers, the md device
>>>>>>> continue to serve I/O request. Am I right ?
>>>>>>>
>>>>>>
>>>>>> I think, the I/O errors logged by bcache are not real I/O errors,
>>>>>> because the filesystem on top of the bcache device does not report any
>>>>>> I/O errors unless the bcache device gets stopped by bcache due to too
>>>>>> many errors (io_error_limit reached).
>>>>>>
>>>>>> I performed the following test:
>>>>>>
>>>>>> Starting with bcache on a healthy RAID6 with 4 disks (all attached and
>>>>>> completely synced). cache_mode set to "none" to ensure data is read from
>>>>>> the backing device. EXT4 filesystem on top of bcache mounted with two
>>>>>> identical directories each containing 4GB of data on a system with 2GB
>>>>>> of RAM to ensure data is not coming form the page cache. "diff -r dir1
>>>>>> dir2" running in a loop to check for inconsistencies. Also
>>>>>> io_error_limit has been raised to 1073741824 to ensure the bcache device
>>>>>> does not get stopped due to too many io errors during the test.
>>>>>>
>>>>>> As long as all 4 disks attached to the RAID6 array, no messages get logged.
>>>>>>
>>>>>> Once one disk is removed from the RAID6 array using
>>>>>>   mdadm --fail /dev/md0 /dev/sde1
>>>>>> the kernel log gets filled with the
>>>>>>
>>>>>> bcache: bch_count_backing_io_errors() md0: IO error on backing device,
>>>>>> unrecoverable
>>>>>>
>>>>>> messages. However neither the EXT4 filesystem logs any corruptions nor
>>>>>> does the diff comparing the two directories report any inconsistencies.
>>>>>>
>>>>>> Adding the previously removed disk back to the RAID6 array, bcache stops
>>>>>> reporting the above-mentioned error message once the re-added disk is
>>>>>> fully synced and the RAID6 array is healthy again.
>>>>>>
>>>>>> If the I/O requests to the RAID6 device would actually fail, I would
>>>>>> expect to see either EXT4 filesystem errors in the logs or at least diff
>>>>>> reporting differences, but nothing gets logged in the kernel log expect
>>>>>> the above-mentioned message from bcache.
>>>>>>
>>>>>> It seems bcache mistakenly classifies or at least counts some I/O
>>>>>> requests as failed although they have not actually failed.
>>>>>>
>>>>>> By the way Linux 4.9 (from Debian stable) is most probably not affected.
>>>>> Hi Thorsten,
>>>>>
>>>>> Let me try to reproduce and check into. I will ask you for more
>>>>> information later.
>>>>>
>>>>> Very informative, thanks.
>>>>>
>>>>
>>>> Hello Cody.
>>>>
>>>> I'm now running Linux 5.1 and still see the errors described above.
>>>>
>>>> I did some further investigations myself.
>>>>
>>>> The affected bio have the bio_status field set to 10 (=BLK_STS_IOERR)
>>>> and the bio_ops field set to 524288 (=REQ_RAHEAD).
>>>>
>>>> According to the comment in linux/blk_types.h such requests may fail.
>>>> Quote from linux/blk_types.h:
>>>> 	__REQ_RAHEAD,           /* read ahead, can fail anytime */
>>>>
>>>> That would explain why no file system errors or corruptions occur,
>>>> although bcache reports IO errors from the backing device.
>>>>
>>>> Thus I assume errors resulting from such read-ahead bio requests should
>>>> not be counted/ignored by bcache.
>>>
>>> Hi Thorsten,
>>>
>>> Do you mean should not be counted, or should not be ignored for
>>> read-ahead bio failure ?
>>>
>>> Thanks.
>>>
>>>
>>
>> I'm far from being a Linux block IO subsystem expert.
>> My assumption is that a block device has the option to fail read-ahead
>> bio requests under certain circumstances, for example, if receiving the
>> requested sectors is too expensive and that the MD RAID6 code makes use
>> of that option when the RAID array is in a degraded state. But I'm just
>> guessing.
>>
>> I'm not sure how such errors are handled correctly, probably they can
>> simply be ignored completely, but should at least not contribute to the
>> bcache error counter (dc->io_errors).
> 
> Hi Thorsten,
> 
> As you said "should at least not contribute to the
>> bcache error counter (dc->io_errors)", the challenge is that I need a
> method to distinguish a real device I/O failure or a md raid6 degraded
> failure. So far I don't have idea how to make it.

Maybe:

--- linux-5.1/drivers/md/bcache/io.c-orig       2019-05-07
15:34:23.283543872 +0200
+++ linux-5.1/drivers/md/bcache/io.c    2019-05-07 15:36:11.133543872 +0200
@@ -58,6 +58,8 @@ void bch_count_backing_io_errors(struct

        WARN_ONCE(!dc, "NULL pointer of struct cached_dev");

+       if (bio && (bio->bi_opf & REQ_RAHEAD))
+               return;
        errors = atomic_add_return(1, &dc->io_errors);
        if (errors < dc->error_limit)
                pr_err("%s: IO error on backing device, unrecoverable",






-- 
___
 |        | /                 E-Mail: linux@xxxxxxxxxxxxxxxxx
 |horsten |/\nabe                WWW: http://linux.thorsten-knabe.de