Re: Fwd: Re: mdadm I/O error with Ddf RAID

Arka Sharma <arka.sw1988@xxxxxxxxx> · Thu, 24 Nov 2016 16:59:02 +0530

Thanks Neil for your assistance. We have root caused the issue. There
was a problem in setting PhysicalRefNo and Starting Block in Config
Record. Now the wrong LBA is not seen. Thanks for your response.

Regards,
Arka

On Tue, Nov 22, 2016 at 3:00 PM, Arka Sharma <arka.sw1988@xxxxxxxxx> wrote:
> I have observed that following block
> else if (!mddev->bitmap)
>                         j = mddev->recovery_cp;
> is getting executed in md_do_sync. I performed to test. In case 1 I
> filled the entire 32 mb of physical disks with FF and then wrote the
> metadata. And in the following case we filled the 32 mb with zeros and
> then wrote the metadata. In both the cases we receive md/raid1:md126:
> not clean -- starting background reconstruction message from md when
> there is access to LBA 1000182866. However when I create raid 1 using
> mdadm and reboot the system there is no access to  LBA 1000182866.
> Also when I read that sector after creating raid 1 with mdadm we see
> this block contains FF. As we have confirmed that mdadm also writing
> the config data at 1000182610. Only in case of raid created through
> our application results access at that offset.
>
> Regards,
> Arka
>
> On Tue, Nov 22, 2016 at 5:24 AM, NeilBrown <neilb@xxxxxxxx> wrote:
>> On Tue, Nov 22 2016, Arka Sharma wrote:
>>
>>> ---------- Forwarded message ----------
>>> From: "Arka Sharma" <arka.sw1988@xxxxxxxxx>
>>> Date: 21 Nov 2016 12:57 p.m.
>>> Subject: Re: mdadm I/O error with Ddf RAID
>>> To: "NeilBrown" <neilb@xxxxxxxx>
>>> Cc: <linux-raid@xxxxxxxxxxxxxxx>
>>>
>>> I have run mdadm --examine on both the component devices as well as on
>>> the container. This shows that one of the component disk is marked as
>>> offline and status is failed. When I run mdadm --detail on the RAID
>>> device it shows the component disk 0 state as removed. Since I am very
>>> much new to md and linux in general I am not able to fully root cause
>>> this issue. I have made couple of observation though, that before the
>>> invalid sector 18446744073709551615 is sent, the sector 1000182866 is
>>> accessed after which mdraid reports as not clean starts background
>>> reconstruction. I read the LBA 1000182866 and this block contains FF.
>>> So is md expecting something in the metadata we are not populating ?
>>> Please find the attached md127.txt which is the output of the mdadm
>>> --examine <container>, blk-core_diff.txt which contains the printk's
>>> and dmesg.txt, also DDF_Header0.txt and DDF_Header1.txt are the dump
>>> of ddf headers for both the disks.
>>
>> Thanks for providing more details.
>>
>> Sector 1000182866 is 256 sectors into the config section.
>> It starts reading the config section at 1000182610 and gets 256 sectors,
>> so it reads the rest from 1000182866 and then starts the array.
>>
>> My guess is that md is getting confused about resync and recovery.
>> It tries a resync, but as the array appears degraded, this code:
>>                 if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
>>                         j = mddev->resync_min;
>>                 else if (!mddev->bitmap)
>>                         j = mddev->recovery_cp;
>>
>> in md_do_sync() sets 'j' to MaxSector, which is effectively "-1".  It
>> then starts resync from there and goes crazy.  You could put a printk in
>> there to confirm.
>>
>> I don't know why.  Something about the config makes mdadm think the
>> array is degraded.  I might try to find time to dig into it again later.
>>
>> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html