Re: RAID6 gets stuck during reshape with 100% CPU

Anssi Hannula <anssi.hannula@xxxxxx> · Sat, 26 Oct 2019 13:07:51 +0300

Song Liu kirjoitti 2019-10-25 01:56:
On Thu, Oct 24, 2019 at 12:42 PM Anssi Hannula <anssi.hannula@xxxxxx> 
wrote:
Song Liu kirjoitti 2019-10-24 21:50:
> On Sat, Oct 19, 2019 at 2:10 AM Anssi Hannula <anssi.hannula@xxxxxx>
> wrote:
>>
>> Hi all,
>>
>> I'm seeing a reshape issue where the array gets stuck with requests
>> seemingly getting blocked and md0_raid6 process taking 100% CPU
>> whenever
>> I --continue the reshape.
>>
>>  From what I can tell, the md0_raid6 process is stuck processing a set
>> of
>> stripes over and over via handle_stripe() without progressing.
>>
>> Log excerpt of one handle_stripe() of an affected stripe with some
>> extra
>> logging is below.
>> The 4600-5200 integers are line numbers for
>> http://onse.fi/files/reshape-infloop-issue/raid5.c .
>
> Maybe add sh->sector to DEBUGPRINT()?

Note that the XX debug printing was guarded by

  bool debout = (sh->sector == 198248960) && __ratelimit(&_rsafasfas);

So everything was for sector 198248960 and rate limited every 20sec to
avoid a flood.

> Also, please add more DEBUGPRINT() in the
>
> if (sh->reconstruct_state == reconstruct_state_result) {
>
> case.

OK, added prints there.

Though after logging I noticed that the execution never gets there,
sh->reconstruct_state is always reconstruct_state_idle at that point.
It gets cleared on the "XX too many failed" log message (line 4798).

I guess the failed = 10 is the problem here..

What does /proc/mdstat say?

After --assemble --backup-file=xx, before --grow:

md0 : active raid6 sdac[0] sdf[21] sdh[17] sdj[18] sde[26] sdr[20] 
sds[15] sdad[25] sdk[13] sdp[27] sdo[11] sdl[10] sdn[9] sdt[16] md8[28] 
sdi[22] sdae[23] sdaf[24] sdm[3] sdg[2] sdq[1]
      74232661248 blocks super 1.1 level 6, 64k chunk, algorithm 2 
[20/20] [UUUUUUUUUUUUUUUUUUUU]
      [===================>.]  reshape = 97.5% (4024886912/4124036736) 
finish=10844512.0min speed=0K/sec
      bitmap: 5/31 pages [20KB], 65536KB chunk

After --grow --continue --backup-file=xx (i.e. during the 
handle_stripe() loop):

md0 : active raid6 sdac[0] sdf[21] sdh[17] sdj[18] sde[26] sdr[20] 
sds[15] sdad[25] sdk[13] sdp[27] sdo[11] sdl[10] sdn[9] sdt[16] md8[28] 
sdi[22] sdae[23] sdaf[24] sdm[3] sdg[2] sdq[1]
      74232661248 blocks super 1.1 level 6, 64k chunk, algorithm 2 
[20/20] [UUUUUUUUUUUUUUUUUUUU]
      [===================>.]  reshape = 97.5% (4024917256/4124036736) 
finish=7674.2min speed=215K/sec
      bitmap: 5/31 pages [20KB], 65536KB chunk

After a reboot due to the stuck array the reshape progress gets reset 
back to 4024886912.

--
Anssi Hannula