Re: RAID6 gets stuck during reshape with 100% CPU

Song Liu <liu.song.a23@xxxxxxxxx> · Mon, 28 Oct 2019 23:04:21 -0700

On Sat, Oct 26, 2019 at 3:07 AM Anssi Hannula <anssi.hannula@xxxxxx> wrote:
>
> Song Liu kirjoitti 2019-10-25 01:56:
> > On Thu, Oct 24, 2019 at 12:42 PM Anssi Hannula <anssi.hannula@xxxxxx>
> > wrote:
> >> Song Liu kirjoitti 2019-10-24 21:50:
> >> > On Sat, Oct 19, 2019 at 2:10 AM Anssi Hannula <anssi.hannula@xxxxxx>
> >> > wrote:
> >> >>
> >> >> Hi all,
> >> >>
> >> >> I'm seeing a reshape issue where the array gets stuck with requests
> >> >> seemingly getting blocked and md0_raid6 process taking 100% CPU
> >> >> whenever
> >> >> I --continue the reshape.
> >> >>
> >> >>  From what I can tell, the md0_raid6 process is stuck processing a set
> >> >> of
> >> >> stripes over and over via handle_stripe() without progressing.
> >> >>
> >> >> Log excerpt of one handle_stripe() of an affected stripe with some
> >> >> extra
> >> >> logging is below.
> >> >> The 4600-5200 integers are line numbers for
> >> >> http://onse.fi/files/reshape-infloop-issue/raid5.c .
> >> >
> >> > Maybe add sh->sector to DEBUGPRINT()?
> >>
> >> Note that the XX debug printing was guarded by
> >>
> >>   bool debout = (sh->sector == 198248960) && __ratelimit(&_rsafasfas);
> >>
> >> So everything was for sector 198248960 and rate limited every 20sec to
> >> avoid a flood.
> >>
> >> > Also, please add more DEBUGPRINT() in the
> >> >
> >> > if (sh->reconstruct_state == reconstruct_state_result) {
> >> >
> >> > case.
> >>
> >> OK, added prints there.
> >>
> >> Though after logging I noticed that the execution never gets there,
> >> sh->reconstruct_state is always reconstruct_state_idle at that point.
> >> It gets cleared on the "XX too many failed" log message (line 4798).
> >>
> > I guess the failed = 10 is the problem here..
> >
> > What does /proc/mdstat say?
>
> After --assemble --backup-file=xx, before --grow:
>
> md0 : active raid6 sdac[0] sdf[21] sdh[17] sdj[18] sde[26] sdr[20]
> sds[15] sdad[25] sdk[13] sdp[27] sdo[11] sdl[10] sdn[9] sdt[16] md8[28]
> sdi[22] sdae[23] sdaf[24] sdm[3] sdg[2] sdq[1]
>        74232661248 blocks super 1.1 level 6, 64k chunk, algorithm 2
> [20/20] [UUUUUUUUUUUUUUUUUUUU]
>        [===================>.]  reshape = 97.5% (4024886912/4124036736)
> finish=10844512.0min speed=0K/sec
>        bitmap: 5/31 pages [20KB], 65536KB chunk
>
> After --grow --continue --backup-file=xx (i.e. during the
> handle_stripe() loop):
>
> md0 : active raid6 sdac[0] sdf[21] sdh[17] sdj[18] sde[26] sdr[20]
> sds[15] sdad[25] sdk[13] sdp[27] sdo[11] sdl[10] sdn[9] sdt[16] md8[28]
> sdi[22] sdae[23] sdaf[24] sdm[3] sdg[2] sdq[1]
>        74232661248 blocks super 1.1 level 6, 64k chunk, algorithm 2
> [20/20] [UUUUUUUUUUUUUUUUUUUU]
>        [===================>.]  reshape = 97.5% (4024917256/4124036736)
> finish=7674.2min speed=215K/sec
>        bitmap: 5/31 pages [20KB], 65536KB chunk
>
> After a reboot due to the stuck array the reshape progress gets reset
> back to 4024886912.

So, I am looking at this part.

[  274.084437] check 19: state 0x813 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.084969] check 18: state 0xa01 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.085491] check 17: state 0xa01 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.086036] check 16: state 0xa01 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.086569] check 15: state 0x811 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.087088] check 14: state 0xa01 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.087612] check 13: state 0xa01 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.088129] check 12: state 0xa01 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.088652] check 11: state 0x811 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.089167] check 10: state 0x811 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.089691] check 9: state 0x811 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.090209] check 8: state 0x811 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.090727] check 7: state 0xa01 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.091243] check 6: state 0xa01 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.091763] check 5: state 0xa01 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.092279] check 4: state 0xa01 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.092800] check 3: state 0x811 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.093309] check 2: state 0x811 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.093810] check 1: state 0x811 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11
[  274.094317] check 0: state 0x813 read 00000000b79fab11 write
00000000b79fab11 written 00000000b79fab11

So 10 devices don't have R5_Insync here.

Could you please add more prints to the following section:

if (!rdev)
        /* Not in-sync */;
else if (is_bad) {
       /* also not in-sync */

[...]

else if (test_bit(R5_UPTODATE, &dev->flags) &&
         test_bit(R5_Expanded, &dev->flags))
        /* If we've reshaped into here, we assume it is Insync.
        * We will shortly update recovery_offset to make
        * it official.
       */
       set_bit(R5_Insync, &dev->flags);

I guess we get into "is_bad", case, but it should not be the case?

Thanks,
Song