> On Aug 23, 2019, at 10:03 AM, Coly Li <colyli@xxxxxxx> wrote: > > On 2019/8/24 12:37 上午, Song Liu wrote: >> Thanks Coly and Neil. >> >>> On Aug 22, 2019, at 5:02 PM, NeilBrown <neilb@xxxxxxxx> wrote: >>> >>> On Thu, Aug 22 2019, Coly Li wrote: >>> >>>> Hi folks, >>>> >>>> First line: This bug only influences md raid0 device which applies all >>>> the following conditions, >>>> 1) Assembled by component disks with different sizes. >>>> 2) Created and used under Linux kernel before (including) Linux v3.12, >>>> then upgrade to Linux kernel after (including) Linux v3.13. >>>> 3) New data are written to md raid0 in new kernel >= Linux v3.13. >>>> Then the md raid0 may have inconsistent sector mapping and experience >>>> data corruption. >>>> >>>> Recently I receive a bug report that customer encounter file system >>>> corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns >>>> out to be the underlying md raid0 corruption after the kernel upgrade. >>>> >>>> I find it is because a sector map bug in md raid0 code include and >>>> before Linux v3.12. Here is the buggy code piece I copied from stable >>>> Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(), >>>> >>>> 547 sector_offset = bio->bi_sector; >>>> 548 zone = find_zone(mddev->private, §or_offset); >>>> 549 tmp_dev = map_sector(mddev, zone, bio->bi_sector, >>>> 550 §or_offset); >>> >>> I don't think this code is buggy. The mapping may not be the mapping >>> you would expect, but it is the mapping that md/raid0 had always used up >>> to this time. >>> >>>> >>>> At line 548 after find_zone() returns, sector_offset is updated to be an >>>> offset inside current zone. Then at line 549 the third parameter of >>>> calling map_sector() should be the updated sector_offset, but >>>> bio->bi_sector (original LBA or md raid0 device) is used. If the raid0 >>>> device has *multiple zones*, except the first zone, the mapping <dev, >>>> sector> pair returned by map_sector() for all rested zones are >>>> unexpected and wrong. >>>> >>>> The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784 >>>> ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate >>>> the mistaken mapping calculation has stable and unique result too, so it >>>> works without obvious problem until commit 20d0189b1012 ("block: >>>> Introduce new bio_split()") merged into Linux v3.13. >>>> >>>> This patch fixed the mistaken mapping in the following lines of change, >>>> 654 - sector_offset = bio->bi_iter.bi_sector; >>>> 655 - zone = find_zone(mddev->private, §or_offset); >>>> 656 - tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector, >>>> 657 - §or_offset); >>>> >>>> 694 + zone = find_zone(mddev->private, §or); >>>> 695 + tmp_dev = map_sector(mddev, zone, sector, §or); >>>> At line 695 of this patch, the third parameter of calling map_sector() >>>> is fixed to 'sector', this is the correct value which contains the >>>> sector offset inside the corresponding zone. >>> >>> This is buggy because, as you say, the third argument to map_sector has >>> changed. >>> Previously it was bio->bi_iter.bi_sector. Now it is 'sector' which >>> find_zone has just modified. >>> >>>> >>>> The this patch implicitly *changes* md raid0 on-disk layout. If a md >>>> raid0 has component disks with *different* sizes, then it will contain >>>> multiple zones. If such multiple zones raid0 device is created before >>>> Linux v3.13, all data chunks after first zone will be mapped to >>>> different location in kernel after (including) Linux v3.13. The result >>>> is, data written in the LBA after first zone will be treated as >>>> corruption. A worse case is, if the md raid0 has data chunks filled in >>>> first md raid0 zone in Linux v3.12 (or earlier kernels), then update to >>>> Linux v3.13 (or later kernels) and fill more data chunks in second and >>>> rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always >>>> partial data corrupted. >>>> >>>> Currently there is no way to tell whether a md raid0 device is mapped in >>>> wrong calculation in kernel before (including) Linux v3.12 or in correct >>>> calculation in kernels after (including) Linux v3.13. If a md raid0 >>>> device (contains multiple zones) created and used crossing these kernel >>>> version, there is possibility and different mapping calculation >>>> generation different/inconsistent on-disk layout in different md raid0 >>>> zones, and results data corruption. >>>> >>>> For our enterprise Linux products we can handle it properly for a few >>>> product number of kernels. But for upstream and stable kernels, I don't >>>> have idea how to fix this ugly problem in a generic way. >>>> >>>> Neil Brown discussed with me offline, he proposed a temporary workaround >>>> that only permit to assemble md raid0 device with identical component >>>> disk size, and reject to assemble md raid0 device with component disks >>>> with different sizes. We can stop this workaround when there is a proper >>>> way to fix the problem. >>>> >>>> I suggest our developer community to work together for a solution, this >>>> is the motivation I post this email for your comments. >>> >>> There are four separate cases that we need to consider: >>> - v1.x metadata >>> - v0.90 metadata >>> - LVM metadata (via dm-raid) >>> - no metadata (array created with "mdadm --build"). >>> >>> For v1.x metadata, I think we can add a new "feature_map" flag. >>> If this flag isn't set, raid0 with non-uniform device sizes will not be >>> assembled. >>> If it is set, then: >>> if 'layout' is 0, use the old mapping >>> if 'layout' is 1, use the new mapping >>> >>> For v0.90 metadata we don't have feature-flags. We could >>> The gvalid_words field is unused and always set to zero. >>> So we could start storing some feature bits there. >>> >>> For LVM/dm-raid, I suspect it doesn't support varying >>> sized devices, but we would need to check. >>> >>> For "no metadata" arrays ... we could possibly just stop supporting >>> them - I doubt they are used much. >> >> So for an existing array, we really cannot tell whether it is broken or >> not, right? If this is the case, we only need to worry about new arrays. >> >> For new arrays, I guess we can only allow v1.x raid0 to have non-uniform >> devices sizes, and use the new feature_map bit. >> >> Would this work? If so, we only have 1 case to work on. > > It seems v1.2 support started since Linux v2.16, so it may also have > problem for multiple zones. > For v1.2 metadata, we still need the feature_map bit, meaning this non-uniform array is safe to assemble. If the array doesn't have this bit, and is non-uniform size, we refuse to assemble it. Thanks, Song