Re: RAID6 gets stuck during reshape with 100% CPU

Song Liu <liu.song.a23@xxxxxxxxx> · Tue, 29 Oct 2019 13:28:21 -0700

On Tue, Oct 29, 2019 at 12:05 PM Anssi Hannula <anssi.hannula@xxxxxx> wrote:
>
> Song Liu kirjoitti 2019-10-29 08:04:
> > On Sat, Oct 26, 2019 at 3:07 AM Anssi Hannula <anssi.hannula@xxxxxx>
> > wrote:
> >>
> >> Song Liu kirjoitti 2019-10-25 01:56:
> >> > On Thu, Oct 24, 2019 at 12:42 PM Anssi Hannula <anssi.hannula@xxxxxx>
> >> > wrote:
> >> >> Song Liu kirjoitti 2019-10-24 21:50:
> >> >> > On Sat, Oct 19, 2019 at 2:10 AM Anssi Hannula <anssi.hannula@xxxxxx>
> >> >> > wrote:
> >> >> >>
> >> >> >> Hi all,
> >> >> >>
> >> >> >> I'm seeing a reshape issue where the array gets stuck with requests
> >> >> >> seemingly getting blocked and md0_raid6 process taking 100% CPU
> >> >> >> whenever
> >> >> >> I --continue the reshape.
> >> >> >>
> >> >> >>  From what I can tell, the md0_raid6 process is stuck processing a set
> >> >> >> of
> >> >> >> stripes over and over via handle_stripe() without progressing.
> >> >> >>
> >> >> >> Log excerpt of one handle_stripe() of an affected stripe with some
> >> >> >> extra
> >> >> >> logging is below.
> >> >> >> The 4600-5200 integers are line numbers for
> >> >> >> http://onse.fi/files/reshape-infloop-issue/raid5.c .
> >> >> >
> >> >> > Maybe add sh->sector to DEBUGPRINT()?
> >> >>
> >> >> Note that the XX debug printing was guarded by
> >> >>
> >> >>   bool debout = (sh->sector == 198248960) && __ratelimit(&_rsafasfas);
> >> >>
> >> >> So everything was for sector 198248960 and rate limited every 20sec to
> >> >> avoid a flood.
> >> >>
> >> >> > Also, please add more DEBUGPRINT() in the
> >> >> >
> >> >> > if (sh->reconstruct_state == reconstruct_state_result) {
> >> >> >
> >> >> > case.
> >> >>
> >> >> OK, added prints there.
> >> >>
> >> >> Though after logging I noticed that the execution never gets there,
> >> >> sh->reconstruct_state is always reconstruct_state_idle at that point.
> >> >> It gets cleared on the "XX too many failed" log message (line 4798).
> >> >>
> >> > I guess the failed = 10 is the problem here..
> >> >
> >> > What does /proc/mdstat say?
> >>
> >> After --assemble --backup-file=xx, before --grow:
> >>
> >> md0 : active raid6 sdac[0] sdf[21] sdh[17] sdj[18] sde[26] sdr[20]
> >> sds[15] sdad[25] sdk[13] sdp[27] sdo[11] sdl[10] sdn[9] sdt[16]
> >> md8[28]
> >> sdi[22] sdae[23] sdaf[24] sdm[3] sdg[2] sdq[1]
> >>        74232661248 blocks super 1.1 level 6, 64k chunk, algorithm 2
> >> [20/20] [UUUUUUUUUUUUUUUUUUUU]
> >>        [===================>.]  reshape = 97.5%
> >> (4024886912/4124036736)
> >> finish=10844512.0min speed=0K/sec
> >>        bitmap: 5/31 pages [20KB], 65536KB chunk
> >>
> >> After --grow --continue --backup-file=xx (i.e. during the
> >> handle_stripe() loop):
> >>
> >> md0 : active raid6 sdac[0] sdf[21] sdh[17] sdj[18] sde[26] sdr[20]
> >> sds[15] sdad[25] sdk[13] sdp[27] sdo[11] sdl[10] sdn[9] sdt[16]
> >> md8[28]
> >> sdi[22] sdae[23] sdaf[24] sdm[3] sdg[2] sdq[1]
> >>        74232661248 blocks super 1.1 level 6, 64k chunk, algorithm 2
> >> [20/20] [UUUUUUUUUUUUUUUUUUUU]
> >>        [===================>.]  reshape = 97.5%
> >> (4024917256/4124036736)
> >> finish=7674.2min speed=215K/sec
> >>        bitmap: 5/31 pages [20KB], 65536KB chunk
> >>
> >> After a reboot due to the stuck array the reshape progress gets reset
> >> back to 4024886912.
> >
> > So, I am looking at this part.
> >
> > [  274.084437] check 19: state 0x813 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.084969] check 18: state 0xa01 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.085491] check 17: state 0xa01 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.086036] check 16: state 0xa01 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.086569] check 15: state 0x811 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.087088] check 14: state 0xa01 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.087612] check 13: state 0xa01 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.088129] check 12: state 0xa01 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.088652] check 11: state 0x811 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.089167] check 10: state 0x811 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.089691] check 9: state 0x811 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.090209] check 8: state 0x811 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.090727] check 7: state 0xa01 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.091243] check 6: state 0xa01 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.091763] check 5: state 0xa01 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.092279] check 4: state 0xa01 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.092800] check 3: state 0x811 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.093309] check 2: state 0x811 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.093810] check 1: state 0x811 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> > [  274.094317] check 0: state 0x813 read 00000000b79fab11 write
> > 00000000b79fab11 written 00000000b79fab11
> >
> > So 10 devices don't have R5_Insync here.
> >
> > Could you please add more prints to the following section:
> >
> > if (!rdev)
> >         /* Not in-sync */;
> > else if (is_bad) {
> >        /* also not in-sync */
> >
> > [...]
> >
> > else if (test_bit(R5_UPTODATE, &dev->flags) &&
> >          test_bit(R5_Expanded, &dev->flags))
> >         /* If we've reshaped into here, we assume it is Insync.
> >         * We will shortly update recovery_offset to make
> >         * it official.
> >        */
> >        set_bit(R5_Insync, &dev->flags);
>
>
> Added some more debugging:
> http://onse.fi/files/reshape-infloop-issue/dmesg-3.txt
> http://onse.fi/files/reshape-infloop-issue/raid5-3.c
> with excerpt below.
>
> > I guess we get into "is_bad", case, but it should not be the case?
>
> Right, is_bad is set, which causes R5_Insync and R5_ReadError to be set
> on lines 4497-4498, and R5_Insync to be cleared on line 4554 (if
> R5_ReadError then clear R5_Insync).
>
> As mentioned in my first message and seen in
> http://onse.fi/files/reshape-infloop-issue/examine-all.txt , the MD bad
> block lists contain blocks (suspiciously identical across devices).
> So maybe the code can't properly handle the case where 10 devices have
> the same block in their bad block list. Not quite sure what "handle"
> should mean in this case but certainly something else than a
> handle_stripe() loop :)
> There is a "bad" block on 10 devices on sector 198504960, which I guess
> matches sh->sector 198248960 due to data offset of 256000 sectors (per
> --examine).

OK, it makes sense now. I didn't add the data offset when checking the bad
block data.

>
> I've wondered if "dd if=/dev/md0 of=/dev/md0" for the affected blocks
> would clear the bad blocks and avoid this issue, but I haven't tried
> that yet so that the infinite loop issue can be investigated/fixed
> first. I already checked that /dev/md0 is fully readable (which also
> confuses me a bit since md(8) says "Attempting to read from a known bad
> block will cause a read error"... maybe I'm missing something).
>

Maybe try these steps?

https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy#How_do_I_fix_a_Bad_Blocks_problem.3F

Thanks,
Song