Re: BUG: bcache failing on top of degraded RAID-6

Thorsten Knabe <linux@xxxxxxxxxxxxxxxxx> · Tue, 7 May 2019 15:01:24 +0200

On 5/7/19 2:23 PM, Coly Li wrote:
> On 2019/5/7 8:19 下午, Thorsten Knabe wrote:
>> On 3/27/19 2:45 PM, Coly Li wrote:
>>> On 2019/3/27 9:42 下午, Thorsten Knabe wrote:
>>>> On 3/27/19 12:53 PM, Coly Li wrote:
>>>>> On 2019/3/27 7:00 下午, Thorsten Knabe wrote:
>>>>>> On 3/27/19 10:44 AM, Coly Li wrote:
>>>>>>> On 2019/3/26 9:21 下午, Thorsten Knabe wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> there seems to be a serious problem, when running bcache on top of a
>>>>>>>> degraded RAID-6 (MD) array. The bcache device /dev/bcache0 disappears
>>>>>>>> after a few I/O operations on the affected device and the kernel log
>>>>>>>> gets filled with the following log message:
>>>>>>>>
>>>>>>>> bcache: bch_count_backing_io_errors() md0: IO error on backing device,
>>>>>>>> unrecoverable
>>>>>>>>
>>>>>>>
>>>>>>> It seems I/O request onto backing device failed. If the md raid6 device
>>>>>>> is the backing device, does it go into read-only mode after degrade ?
>>>>>>
>>>>>> No, the RAID6 backing device is still in read-write mode after the disk
>>>>>> has been removed from the RAID array. That's the way RAID6 is supposed
>>>>>> to work.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Setup:
>>>>>>>> Linux kernel: 5.1-rc2, 5.0.4, 4.19.0-0.bpo.2-amd64 (Debian backports)
>>>>>>>> all affected
>>>>>>>> bcache backing device: EXT4 filesystem -> /dev/bcache0 -> /dev/md0 ->
>>>>>>>> /dev/sd[bcde]1
>>>>>>>> bcache cache device: /dev/sdf1
>>>>>>>> cache mode: writethrough, none and cache device detached are all
>>>>>>>> affected, writeback and writearound has not been tested
>>>>>>>> KVM for testing, first observed on real hardware (failing RAID device)
>>>>>>>>
>>>>>>>> As long as the RAID6 is healthy, bcache works as expected. Once the
>>>>>>>> RAID6 gets degraded, for example by removing a drive from the array
>>>>>>>> (mdadm --fail /dev/md0 /dev/sde1, mdadm --remove /dev/md0 /dev/sde1),
>>>>>>>> the above-mentioned log messages appear in the kernel log and the bcache
>>>>>>>> device /dev/bcache0 disappears shortly afterwards logging:
>>>>>>>>
>>>>>>>> bcache: bch_cached_dev_error() stop bcache0: too many IO errors on
>>>>>>>> backing device md0
>>>>>>>>
>>>>>>>> to the kernel log.
>>>>>>>>
>>>>>>>> Increasing /sys/block/bcache0/bcache/io_error_limit to a very high value
>>>>>>>> (1073741824) the bcache device /dev/bcache0 remains usable without any
>>>>>>>> noticeable filesystem corruptions.
>>>>>>>
>>>>>>> If the backing device goes into read-only mode, bcache will take this
>>>>>>> backing device as a failure status. The behavior is to stop the bcache
>>>>>>> device of the failed backing device, to notify upper layer something
>>>>>>> goes wrong.
>>>>>>>
>>>>>>> In writethough and writeback mode, bcache requires the backing device to
>>>>>>> be writable.
>>>>>>
>>>>>> But, the degraded (one disk of the array missing) RAID6 device is still
>>>>>> writable.
>>>>>>
>>>>>> Also after raising the io_error_limit of the bcache device to a very
>>>>>> high value (1073741824 in my tests) I can use the bcache device on the
>>>>>> degraded RAID6 array for hours reading and writing gigabytes of data,
>>>>>> without getting any I/O errors or observing any filesystem corruptions.
>>>>>> I'm just getting a lot of those
>>>>>>
>>>>>> bcache: bch_count_backing_io_errors() md0: IO error on backing device,
>>>>>> unrecoverable
>>>>>>
>>>>>> messages in the kernel log.
>>>>>>
>>>>>> It seems that I/O requests for data that have been successfully
>>>>>> recovered by the RAID6 from the redundant information stored on the
>>>>>> additional disks are accidentally counted as failed I/O requests and
>>>>>> when the configured io_error_limit for the bcache device is reached, the
>>>>>> bcache device gets stopped.
>>>>> Oh, thanks for the informaiton.
>>>>>
>>>>> It sounds during md raid6 degrading and recovering, some I/O from bcache
>>>>> might be failed, and after md raid6 degrades and recovers, the md device
>>>>> continue to serve I/O request. Am I right ?
>>>>>
>>>>
>>>> I think, the I/O errors logged by bcache are not real I/O errors,
>>>> because the filesystem on top of the bcache device does not report any
>>>> I/O errors unless the bcache device gets stopped by bcache due to too
>>>> many errors (io_error_limit reached).
>>>>
>>>> I performed the following test:
>>>>
>>>> Starting with bcache on a healthy RAID6 with 4 disks (all attached and
>>>> completely synced). cache_mode set to "none" to ensure data is read from
>>>> the backing device. EXT4 filesystem on top of bcache mounted with two
>>>> identical directories each containing 4GB of data on a system with 2GB
>>>> of RAM to ensure data is not coming form the page cache. "diff -r dir1
>>>> dir2" running in a loop to check for inconsistencies. Also
>>>> io_error_limit has been raised to 1073741824 to ensure the bcache device
>>>> does not get stopped due to too many io errors during the test.
>>>>
>>>> As long as all 4 disks attached to the RAID6 array, no messages get logged.
>>>>
>>>> Once one disk is removed from the RAID6 array using
>>>>   mdadm --fail /dev/md0 /dev/sde1
>>>> the kernel log gets filled with the
>>>>
>>>> bcache: bch_count_backing_io_errors() md0: IO error on backing device,
>>>> unrecoverable
>>>>
>>>> messages. However neither the EXT4 filesystem logs any corruptions nor
>>>> does the diff comparing the two directories report any inconsistencies.
>>>>
>>>> Adding the previously removed disk back to the RAID6 array, bcache stops
>>>> reporting the above-mentioned error message once the re-added disk is
>>>> fully synced and the RAID6 array is healthy again.
>>>>
>>>> If the I/O requests to the RAID6 device would actually fail, I would
>>>> expect to see either EXT4 filesystem errors in the logs or at least diff
>>>> reporting differences, but nothing gets logged in the kernel log expect
>>>> the above-mentioned message from bcache.
>>>>
>>>> It seems bcache mistakenly classifies or at least counts some I/O
>>>> requests as failed although they have not actually failed.
>>>>
>>>> By the way Linux 4.9 (from Debian stable) is most probably not affected.
>>> Hi Thorsten,
>>>
>>> Let me try to reproduce and check into. I will ask you for more
>>> information later.
>>>
>>> Very informative, thanks.
>>>
>>
>> Hello Cody.
>>
>> I'm now running Linux 5.1 and still see the errors described above.
>>
>> I did some further investigations myself.
>>
>> The affected bio have the bio_status field set to 10 (=BLK_STS_IOERR)
>> and the bio_ops field set to 524288 (=REQ_RAHEAD).
>>
>> According to the comment in linux/blk_types.h such requests may fail.
>> Quote from linux/blk_types.h:
>> 	__REQ_RAHEAD,           /* read ahead, can fail anytime */
>>
>> That would explain why no file system errors or corruptions occur,
>> although bcache reports IO errors from the backing device.
>>
>> Thus I assume errors resulting from such read-ahead bio requests should
>> not be counted/ignored by bcache.
> 
> Hi Thorsten,
> 
> Do you mean should not be counted, or should not be ignored for
> read-ahead bio failure ?
> 
> Thanks.
> 
> 

I'm far from being a Linux block IO subsystem expert.
My assumption is that a block device has the option to fail read-ahead
bio requests under certain circumstances, for example, if receiving the
requested sectors is too expensive and that the MD RAID6 code makes use
of that option when the RAID array is in a degraded state. But I'm just
guessing.

I'm not sure how such errors are handled correctly, probably they can
simply be ignored completely, but should at least not contribute to the
bcache error counter (dc->io_errors).

Thorsten

-- 
___
 |        | /                 E-Mail: linux@xxxxxxxxxxxxxxxxx
 |horsten |/\nabe                WWW: http://linux.thorsten-knabe.de