Re: RAID6 gets stuck during reshape with 100% CPU

Song Liu <liu.song.a23@xxxxxxxxx> · Wed, 30 Oct 2019 16:11:55 -0700

On Wed, Oct 30, 2019 at 11:25 AM Anssi Hannula <anssi.hannula@xxxxxx> wrote:
>
> Song Liu kirjoitti 2019-10-29 23:55:
> > On Tue, Oct 29, 2019 at 1:45 PM Anssi Hannula <anssi.hannula@xxxxxx>
> > wrote:
> >>
> >> Song Liu kirjoitti 2019-10-29 22:28:
> >> > On Tue, Oct 29, 2019 at 12:05 PM Anssi Hannula <anssi.hannula@xxxxxx>
> >> > wrote:
> >> >>
> >> >> Song Liu kirjoitti 2019-10-29 08:04:
> >> >> > I guess we get into "is_bad", case, but it should not be the case?
> >> >>
> >> >> Right, is_bad is set, which causes R5_Insync and R5_ReadError to be
> >> >> set
> >> >> on lines 4497-4498, and R5_Insync to be cleared on line 4554 (if
> >> >> R5_ReadError then clear R5_Insync).
> >> >>
> >> >> As mentioned in my first message and seen in
> >> >> http://onse.fi/files/reshape-infloop-issue/examine-all.txt , the MD
> >> >> bad
> >> >> block lists contain blocks (suspiciously identical across devices).
> >> >> So maybe the code can't properly handle the case where 10 devices have
> >> >> the same block in their bad block list. Not quite sure what "handle"
> >> >> should mean in this case but certainly something else than a
> >> >> handle_stripe() loop :)
> >> >> There is a "bad" block on 10 devices on sector 198504960, which I
> >> >> guess
> >> >> matches sh->sector 198248960 due to data offset of 256000 sectors (per
> >> >> --examine).
> >> >
> >> > OK, it makes sense now. I didn't add the data offset when checking the
> >> > bad
> >> > block data.
> >> >
> >> >>
> >> >> I've wondered if "dd if=/dev/md0 of=/dev/md0" for the affected blocks
> >> >> would clear the bad blocks and avoid this issue, but I haven't tried
> >> >> that yet so that the infinite loop issue can be investigated/fixed
> >> >> first. I already checked that /dev/md0 is fully readable (which also
> >> >> confuses me a bit since md(8) says "Attempting to read from a known
> >> >> bad
> >> >> block will cause a read error"... maybe I'm missing something).
> >> >>
> >> >
> >> > Maybe try these steps?
> >> >
> >> > https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy#How_do_I_fix_a_Bad_Blocks_problem.3F
> >>
> >> Yeah, I guess those steps would probably resolve my situation. BTW,
> >> "--update=force-no-bbl" is not mentioned on mdadm(8), is it on
> >> purpose?
> >> I was trying to find such an option earlier.
> >>
> >> If you don't need anything more from the array, I'll go ahead and try
> >> clearing the seemingly bogus bad block lists.
> >
> > Please go ahead. We already got quite a few logs.
>
> Seems that was indeed the issue, clearing the bad block log allowed the
> reshape to continue normally.

That's great news!

Song