Song Liu kirjoitti 2019-10-29 23:55:
On Tue, Oct 29, 2019 at 1:45 PM Anssi Hannula <anssi.hannula@xxxxxx>
wrote:
Song Liu kirjoitti 2019-10-29 22:28:
> On Tue, Oct 29, 2019 at 12:05 PM Anssi Hannula <anssi.hannula@xxxxxx>
> wrote:
>>
>> Song Liu kirjoitti 2019-10-29 08:04:
>> > I guess we get into "is_bad", case, but it should not be the case?
>>
>> Right, is_bad is set, which causes R5_Insync and R5_ReadError to be
>> set
>> on lines 4497-4498, and R5_Insync to be cleared on line 4554 (if
>> R5_ReadError then clear R5_Insync).
>>
>> As mentioned in my first message and seen in
>> http://onse.fi/files/reshape-infloop-issue/examine-all.txt , the MD
>> bad
>> block lists contain blocks (suspiciously identical across devices).
>> So maybe the code can't properly handle the case where 10 devices have
>> the same block in their bad block list. Not quite sure what "handle"
>> should mean in this case but certainly something else than a
>> handle_stripe() loop :)
>> There is a "bad" block on 10 devices on sector 198504960, which I
>> guess
>> matches sh->sector 198248960 due to data offset of 256000 sectors (per
>> --examine).
>
> OK, it makes sense now. I didn't add the data offset when checking the
> bad
> block data.
>
>>
>> I've wondered if "dd if=/dev/md0 of=/dev/md0" for the affected blocks
>> would clear the bad blocks and avoid this issue, but I haven't tried
>> that yet so that the infinite loop issue can be investigated/fixed
>> first. I already checked that /dev/md0 is fully readable (which also
>> confuses me a bit since md(8) says "Attempting to read from a known
>> bad
>> block will cause a read error"... maybe I'm missing something).
>>
>
> Maybe try these steps?
>
> https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy#How_do_I_fix_a_Bad_Blocks_problem.3F
Yeah, I guess those steps would probably resolve my situation. BTW,
"--update=force-no-bbl" is not mentioned on mdadm(8), is it on
purpose?
I was trying to find such an option earlier.
If you don't need anything more from the array, I'll go ahead and try
clearing the seemingly bogus bad block lists.
Please go ahead. We already got quite a few logs.
Seems that was indeed the issue, clearing the bad block log allowed the
reshape to continue normally.
Thanks for your help.
--
Anssi Hannula