On 2019/5/8 11:58 下午, Thorsten Knabe wrote: > On 5/7/19 5:05 PM, Coly Li wrote: >> On 2019/5/7 9:48 下午, Thorsten Knabe wrote: >>> On 5/7/19 3:07 PM, Coly Li wrote: >>>> On 2019/5/7 9:01 下午, Thorsten Knabe wrote: >>>>> On 5/7/19 2:23 PM, Coly Li wrote: >>>>>> On 2019/5/7 8:19 下午, Thorsten Knabe wrote: >>>>>>> On 3/27/19 2:45 PM, Coly Li wrote: >>>>>>>> On 2019/3/27 9:42 下午, Thorsten Knabe wrote: >>>>>>>>> On 3/27/19 12:53 PM, Coly Li wrote: >>>>>>>>>> On 2019/3/27 7:00 下午, Thorsten Knabe wrote: >>>>>>>>>>> On 3/27/19 10:44 AM, Coly Li wrote: >>>>>>>>>>>> On 2019/3/26 9:21 下午, Thorsten Knabe wrote: >>>>>>>>>>>>> Hello, >>>>>>>>>>>>> >>>>>>>>>>>>> there seems to be a serious problem, when running bcache on top of a >>>>>>>>>>>>> degraded RAID-6 (MD) array. The bcache device /dev/bcache0 disappears >>>>>>>>>>>>> after a few I/O operations on the affected device and the kernel log >>>>>>>>>>>>> gets filled with the following log message: >>>>>>>>>>>>> >>>>>>>>>>>>> bcache: bch_count_backing_io_errors() md0: IO error on backing device, >>>>>>>>>>>>> unrecoverable >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> It seems I/O request onto backing device failed. If the md raid6 device >>>>>>>>>>>> is the backing device, does it go into read-only mode after degrade ? >>>>>>>>>>> >>>>>>>>>>> No, the RAID6 backing device is still in read-write mode after the disk >>>>>>>>>>> has been removed from the RAID array. That's the way RAID6 is supposed >>>>>>>>>>> to work. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Setup: >>>>>>>>>>>>> Linux kernel: 5.1-rc2, 5.0.4, 4.19.0-0.bpo.2-amd64 (Debian backports) >>>>>>>>>>>>> all affected >>>>>>>>>>>>> bcache backing device: EXT4 filesystem -> /dev/bcache0 -> /dev/md0 -> >>>>>>>>>>>>> /dev/sd[bcde]1 >>>>>>>>>>>>> bcache cache device: /dev/sdf1 >>>>>>>>>>>>> cache mode: writethrough, none and cache device detached are all >>>>>>>>>>>>> affected, writeback and writearound has not been tested >>>>>>>>>>>>> KVM for testing, first observed on real hardware (failing RAID device) >>>>>>>>>>>>> >>>>>>>>>>>>> As long as the RAID6 is healthy, bcache works as expected. Once the >>>>>>>>>>>>> RAID6 gets degraded, for example by removing a drive from the array >>>>>>>>>>>>> (mdadm --fail /dev/md0 /dev/sde1, mdadm --remove /dev/md0 /dev/sde1), >>>>>>>>>>>>> the above-mentioned log messages appear in the kernel log and the bcache >>>>>>>>>>>>> device /dev/bcache0 disappears shortly afterwards logging: >>>>>>>>>>>>> >>>>>>>>>>>>> bcache: bch_cached_dev_error() stop bcache0: too many IO errors on >>>>>>>>>>>>> backing device md0 >>>>>>>>>>>>> >>>>>>>>>>>>> to the kernel log. >>>>>>>>>>>>> >>>>>>>>>>>>> Increasing /sys/block/bcache0/bcache/io_error_limit to a very high value >>>>>>>>>>>>> (1073741824) the bcache device /dev/bcache0 remains usable without any >>>>>>>>>>>>> noticeable filesystem corruptions. >>>>>>>>>>>> >>>>>>>>>>>> If the backing device goes into read-only mode, bcache will take this >>>>>>>>>>>> backing device as a failure status. The behavior is to stop the bcache >>>>>>>>>>>> device of the failed backing device, to notify upper layer something >>>>>>>>>>>> goes wrong. >>>>>>>>>>>> >>>>>>>>>>>> In writethough and writeback mode, bcache requires the backing device to >>>>>>>>>>>> be writable. >>>>>>>>>>> >>>>>>>>>>> But, the degraded (one disk of the array missing) RAID6 device is still >>>>>>>>>>> writable. >>>>>>>>>>> >>>>>>>>>>> Also after raising the io_error_limit of the bcache device to a very >>>>>>>>>>> high value (1073741824 in my tests) I can use the bcache device on the >>>>>>>>>>> degraded RAID6 array for hours reading and writing gigabytes of data, >>>>>>>>>>> without getting any I/O errors or observing any filesystem corruptions. >>>>>>>>>>> I'm just getting a lot of those >>>>>>>>>>> >>>>>>>>>>> bcache: bch_count_backing_io_errors() md0: IO error on backing device, >>>>>>>>>>> unrecoverable >>>>>>>>>>> >>>>>>>>>>> messages in the kernel log. >>>>>>>>>>> >>>>>>>>>>> It seems that I/O requests for data that have been successfully >>>>>>>>>>> recovered by the RAID6 from the redundant information stored on the >>>>>>>>>>> additional disks are accidentally counted as failed I/O requests and >>>>>>>>>>> when the configured io_error_limit for the bcache device is reached, the >>>>>>>>>>> bcache device gets stopped. >>>>>>>>>> Oh, thanks for the informaiton. >>>>>>>>>> >>>>>>>>>> It sounds during md raid6 degrading and recovering, some I/O from bcache >>>>>>>>>> might be failed, and after md raid6 degrades and recovers, the md device >>>>>>>>>> continue to serve I/O request. Am I right ? >>>>>>>>>> >>>>>>>>> >>>>>>>>> I think, the I/O errors logged by bcache are not real I/O errors, >>>>>>>>> because the filesystem on top of the bcache device does not report any >>>>>>>>> I/O errors unless the bcache device gets stopped by bcache due to too >>>>>>>>> many errors (io_error_limit reached). >>>>>>>>> >>>>>>>>> I performed the following test: >>>>>>>>> >>>>>>>>> Starting with bcache on a healthy RAID6 with 4 disks (all attached and >>>>>>>>> completely synced). cache_mode set to "none" to ensure data is read from >>>>>>>>> the backing device. EXT4 filesystem on top of bcache mounted with two >>>>>>>>> identical directories each containing 4GB of data on a system with 2GB >>>>>>>>> of RAM to ensure data is not coming form the page cache. "diff -r dir1 >>>>>>>>> dir2" running in a loop to check for inconsistencies. Also >>>>>>>>> io_error_limit has been raised to 1073741824 to ensure the bcache device >>>>>>>>> does not get stopped due to too many io errors during the test. >>>>>>>>> >>>>>>>>> As long as all 4 disks attached to the RAID6 array, no messages get logged. >>>>>>>>> >>>>>>>>> Once one disk is removed from the RAID6 array using >>>>>>>>> mdadm --fail /dev/md0 /dev/sde1 >>>>>>>>> the kernel log gets filled with the >>>>>>>>> >>>>>>>>> bcache: bch_count_backing_io_errors() md0: IO error on backing device, >>>>>>>>> unrecoverable >>>>>>>>> >>>>>>>>> messages. However neither the EXT4 filesystem logs any corruptions nor >>>>>>>>> does the diff comparing the two directories report any inconsistencies. >>>>>>>>> >>>>>>>>> Adding the previously removed disk back to the RAID6 array, bcache stops >>>>>>>>> reporting the above-mentioned error message once the re-added disk is >>>>>>>>> fully synced and the RAID6 array is healthy again. >>>>>>>>> >>>>>>>>> If the I/O requests to the RAID6 device would actually fail, I would >>>>>>>>> expect to see either EXT4 filesystem errors in the logs or at least diff >>>>>>>>> reporting differences, but nothing gets logged in the kernel log expect >>>>>>>>> the above-mentioned message from bcache. >>>>>>>>> >>>>>>>>> It seems bcache mistakenly classifies or at least counts some I/O >>>>>>>>> requests as failed although they have not actually failed. >>>>>>>>> >>>>>>>>> By the way Linux 4.9 (from Debian stable) is most probably not affected. >>>>>>>> Hi Thorsten, >>>>>>>> >>>>>>>> Let me try to reproduce and check into. I will ask you for more >>>>>>>> information later. >>>>>>>> >>>>>>>> Very informative, thanks. >>>>>>>> >>>>>>> >>>>>>> Hello Cody. >>>>>>> >>>>>>> I'm now running Linux 5.1 and still see the errors described above. >>>>>>> >>>>>>> I did some further investigations myself. >>>>>>> >>>>>>> The affected bio have the bio_status field set to 10 (=BLK_STS_IOERR) >>>>>>> and the bio_ops field set to 524288 (=REQ_RAHEAD). >>>>>>> >>>>>>> According to the comment in linux/blk_types.h such requests may fail. >>>>>>> Quote from linux/blk_types.h: >>>>>>> __REQ_RAHEAD, /* read ahead, can fail anytime */ >>>>>>> >>>>>>> That would explain why no file system errors or corruptions occur, >>>>>>> although bcache reports IO errors from the backing device. >>>>>>> >>>>>>> Thus I assume errors resulting from such read-ahead bio requests should >>>>>>> not be counted/ignored by bcache. >>>>>> >>>>>> Hi Thorsten, >>>>>> >>>>>> Do you mean should not be counted, or should not be ignored for >>>>>> read-ahead bio failure ? >>>>>> >>>>>> Thanks. >>>>>> >>>>>> >>>>> >>>>> I'm far from being a Linux block IO subsystem expert. >>>>> My assumption is that a block device has the option to fail read-ahead >>>>> bio requests under certain circumstances, for example, if receiving the >>>>> requested sectors is too expensive and that the MD RAID6 code makes use >>>>> of that option when the RAID array is in a degraded state. But I'm just >>>>> guessing. >>>>> >>>>> I'm not sure how such errors are handled correctly, probably they can >>>>> simply be ignored completely, but should at least not contribute to the >>>>> bcache error counter (dc->io_errors). >>>> >>>> Hi Thorsten, >>>> >>>> As you said "should at least not contribute to the >>>>> bcache error counter (dc->io_errors)", the challenge is that I need a >>>> method to distinguish a real device I/O failure or a md raid6 degraded >>>> failure. So far I don't have idea how to make it. >>> >>> Maybe: >>> >>> --- linux-5.1/drivers/md/bcache/io.c-orig 2019-05-07 >>> 15:34:23.283543872 +0200 >>> +++ linux-5.1/drivers/md/bcache/io.c 2019-05-07 15:36:11.133543872 +0200 >>> @@ -58,6 +58,8 @@ void bch_count_backing_io_errors(struct >>> >>> WARN_ONCE(!dc, "NULL pointer of struct cached_dev"); >>> >>> + if (bio && (bio->bi_opf & REQ_RAHEAD)) >>> + return; >>> errors = atomic_add_return(1, &dc->io_errors); >>> if (errors < dc->error_limit) >>> pr_err("%s: IO error on backing device, unrecoverable", >> > > Hi Cody. > >> I cannot do this. Because this is real I/O issued to backing device, if >> it failed, it means something really wrong on backing device. > > I have not found a definitive answer or documentation what the > REQ_RAHEAD flag is actually used for. However in my understanding, after > reading a lot of kernel source, it is used as an indication, that the > bio read request is unimportant for proper operation and may be failed > by the block device driver returning BLK_STS_IOERR, if it is too > expensive or requires too many additional resources. > > At least the BTRFS and DRBD code do not take bio request IO errors that > are marked with the REQ_RAHEAD flag into account in their error > counters. Thus it is probably okay if such IO errors with the REQ_RAHEAD > flags set are not counted as errors by bcache too. > >> >> Hmm, If raid6 may returns different error code in bio->bi_status, then >> we can identify this is a failure caused by raid degrade, not a read >> hardware or link failure. But now I am not familiar with raid456 code, >> no idea how to change the md raid code (I assume you meant md raid6)... > > I my assumptions above regarding the REQ_RAHEAD flag are correct, then > the RAID code is correct, because restoring data from the parity > information is a relatively expensive operation for read-ahead data, > that is possibly never actually needed. Hi Thorsten, Thank you for the informative hint. I agree with your idea, it seems ignoring I/O error of REQ_RAHEAD bios does not hurt. Let me think how to fix it by your suggestion. -- Coly Li