Re: [RFC] How to handle an ugly md raid0 sector map bug ?

Coly Li <colyli@xxxxxxx> · Sat, 24 Aug 2019 01:03:55 +0800

On 2019/8/24 12:37 上午, Song Liu wrote:
> Thanks Coly and Neil. 
> 
>> On Aug 22, 2019, at 5:02 PM, NeilBrown <neilb@xxxxxxxx> wrote:
>>
>> On Thu, Aug 22 2019, Coly Li wrote:
>>
>>> Hi folks,
>>>
>>> First line: This bug only influences md raid0 device which applies all
>>> the following conditions,
>>> 1) Assembled by component disks with different sizes.
>>> 2) Created and used under Linux kernel before (including) Linux v3.12,
>>> then upgrade to Linux kernel after (including) Linux v3.13.
>>> 3) New data are written to md raid0 in new kernel >= Linux v3.13.
>>> Then the md raid0 may have inconsistent sector mapping and experience
>>> data corruption.
>>>
>>> Recently I receive a bug report that customer encounter file system
>>> corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns
>>> out to be the underlying md raid0 corruption after the kernel upgrade.
>>>
>>> I find it is because a sector map bug in md raid0 code include and
>>> before Linux v3.12. Here is the buggy code piece I copied from stable
>>> Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(),
>>>
>>> 547         sector_offset = bio->bi_sector;
>>> 548         zone = find_zone(mddev->private, &sector_offset);
>>> 549         tmp_dev = map_sector(mddev, zone, bio->bi_sector,
>>> 550                              &sector_offset);
>>
>> I don't think this code is buggy.  The mapping may not be the mapping
>> you would expect, but it is the mapping that md/raid0 had always used up
>> to this time.
>>
>>>
>>> At line 548 after find_zone() returns, sector_offset is updated to be an
>>> offset inside current zone. Then at line 549 the third parameter of
>>> calling map_sector() should be the updated sector_offset, but
>>> bio->bi_sector (original LBA or md raid0 device) is used. If the raid0
>>> device has *multiple zones*, except the first zone, the mapping <dev,
>>> sector> pair returned by map_sector() for all rested zones are
>>> unexpected and wrong.
>>>
>>> The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784
>>> ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate
>>> the mistaken mapping calculation has stable and unique result too, so it
>>> works without obvious problem until commit 20d0189b1012 ("block:
>>> Introduce new bio_split()") merged into Linux v3.13.
>>>
>>> This patch fixed the mistaken mapping in the following lines of change,
>>> 654 -       sector_offset = bio->bi_iter.bi_sector;
>>> 655 -       zone = find_zone(mddev->private, &sector_offset);
>>> 656 -       tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector,
>>> 657 -                            &sector_offset);
>>>
>>> 694 +               zone = find_zone(mddev->private, &sector);
>>> 695 +               tmp_dev = map_sector(mddev, zone, sector, &sector);
>>> At line 695 of this patch, the third parameter of calling map_sector()
>>> is fixed to 'sector', this is the correct value which contains the
>>> sector offset inside the corresponding zone.
>>
>> This is buggy because, as you say, the third argument to map_sector has
>> changed.
>> Previously it was bio->bi_iter.bi_sector.  Now it is 'sector' which
>> find_zone has just modified.
>>
>>>
>>> The this patch implicitly *changes* md raid0 on-disk layout. If a md
>>> raid0 has component disks with *different* sizes, then it will contain
>>> multiple zones. If such multiple zones raid0 device is created before
>>> Linux v3.13, all data chunks after first zone will be mapped to
>>> different location in kernel after (including) Linux v3.13. The result
>>> is, data written in the LBA after first zone will be treated as
>>> corruption. A worse case is, if the md raid0 has data chunks filled in
>>> first md raid0 zone in Linux v3.12 (or earlier kernels), then update to
>>> Linux v3.13 (or later kernels) and fill more data chunks in second and
>>> rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always
>>> partial data corrupted.
>>>
>>> Currently there is no way to tell whether a md raid0 device is mapped in
>>> wrong calculation in kernel before (including) Linux v3.12 or in correct
>>> calculation in kernels after (including) Linux v3.13. If a md raid0
>>> device (contains multiple zones) created and used crossing these kernel
>>> version, there is possibility and different mapping calculation
>>> generation different/inconsistent on-disk layout in different md raid0
>>> zones, and results data corruption.
>>>
>>> For our enterprise Linux products we can handle it properly for a few
>>> product number of kernels. But for upstream and stable kernels, I don't
>>> have idea how to fix this ugly problem in a generic way.
>>>
>>> Neil Brown discussed with me offline, he proposed a temporary workaround
>>> that only permit to assemble md raid0 device with identical component
>>> disk size, and reject to assemble md raid0 device with component disks
>>> with different sizes. We can stop this workaround when there is a proper
>>> way to fix the problem.
>>>
>>> I suggest our developer community to work together for a solution, this
>>> is the motivation I post this email for your comments.
>>
>> There are four separate cases that we need to consider:
>> - v1.x metadata
>> - v0.90 metadata
>> - LVM metadata (via dm-raid)
>> - no metadata (array created with "mdadm --build").
>>
>> For v1.x metadata, I think we can add a new "feature_map" flag.
>> If this flag isn't set, raid0 with non-uniform device sizes will not be
>> assembled.
>> If it is set, then:
>>  if 'layout' is 0, use the old mapping
>>  if 'layout' is 1, use the new mapping
>>
>> For v0.90 metadata we don't have feature-flags.  We could
>> The gvalid_words field is unused and always set to zero.
>> So we could start storing some feature bits there.
>>
>> For LVM/dm-raid, I suspect it doesn't support varying
>> sized devices, but we would need to check.
>>
>> For "no metadata" arrays ... we could possibly just stop supporting
>> them - I doubt they are used much.
> 
> So for an existing array, we really cannot tell whether it is broken or 
> not, right? If this is the case, we only need to worry about new arrays.
> 
> For new arrays, I guess we can only allow v1.x raid0 to have non-uniform
> devices sizes, and use the new feature_map bit. 
> 
> Would this work? If so, we only have 1 case to work on. 

It seems v1.2 support started since Linux v2.16, so it may also have
problem for multiple zones.

-- 

Coly Li