Re: RAID6 gets stuck during reshape with 100% CPU

Song Liu <liu.song.a23@xxxxxxxxx> · Thu, 24 Oct 2019 15:56:04 -0700

On Thu, Oct 24, 2019 at 12:42 PM Anssi Hannula <anssi.hannula@xxxxxx> wrote:
>
> Song Liu kirjoitti 2019-10-24 21:50:
> > Sorry for delayed reply.
>
> No problem :)
>
> > On Sat, Oct 19, 2019 at 2:10 AM Anssi Hannula <anssi.hannula@xxxxxx>
> > wrote:
> >>
> >> Hi all,
> >>
> >> I'm seeing a reshape issue where the array gets stuck with requests
> >> seemingly getting blocked and md0_raid6 process taking 100% CPU
> >> whenever
> >> I --continue the reshape.
> >>
> >>  From what I can tell, the md0_raid6 process is stuck processing a set
> >> of
> >> stripes over and over via handle_stripe() without progressing.
> >>
> >> Log excerpt of one handle_stripe() of an affected stripe with some
> >> extra
> >> logging is below.
> >> The 4600-5200 integers are line numbers for
> >> http://onse.fi/files/reshape-infloop-issue/raid5.c .
> >
> > Maybe add sh->sector to DEBUGPRINT()?
>
> Note that the XX debug printing was guarded by
>
>   bool debout = (sh->sector == 198248960) && __ratelimit(&_rsafasfas);
>
> So everything was for sector 198248960 and rate limited every 20sec to
> avoid a flood.
>
> > Also, please add more DEBUGPRINT() in the
> >
> > if (sh->reconstruct_state == reconstruct_state_result) {
> >
> > case.
>
> OK, added prints there.
>
> Though after logging I noticed that the execution never gets there,
> sh->reconstruct_state is always reconstruct_state_idle at that point.
> It gets cleared on the "XX too many failed" log message (line 4798).
>
I guess the failed = 10 is the problem here..

What does /proc/mdstat say?

Thanks,
Song