Song Liu kirjoitti 2019-10-29 08:04:
> On Sat, Oct 26, 2019 at 3:07 AM Anssi Hannula <anssi.hannula@xxxxxx>
> wrote:
>>
>> Song Liu kirjoitti 2019-10-25 01:56:
>> > On Thu, Oct 24, 2019 at 12:42 PM Anssi Hannula <anssi.hannula@xxxxxx>
>> > wrote:
>> >> Song Liu kirjoitti 2019-10-24 21:50:
>> >> > On Sat, Oct 19, 2019 at 2:10 AM Anssi Hannula <anssi.hannula@xxxxxx>
>> >> > wrote:
>> >> >>
>> >> >> Hi all,
>> >> >>
>> >> >> I'm seeing a reshape issue where the array gets stuck with requests
>> >> >> seemingly getting blocked and md0_raid6 process taking 100% CPU
>> >> >> whenever
>> >> >> I --continue the reshape.
>> >> >>
>> >> >> From what I can tell, the md0_raid6 process is stuck processing a set
>> >> >> of
>> >> >> stripes over and over via handle_stripe() without progressing.
>> >> >>
>> >> >> Log excerpt of one handle_stripe() of an affected stripe with some
>> >> >> extra
>> >> >> logging is below.
>> >> >> The 4600-5200 integers are line numbers for
>> >> >> http://onse.fi/files/reshape-infloop-issue/raid5.c .
>> >> >
>> >> > Maybe add sh->sector to DEBUGPRINT()?
>> >>
>> >> Note that the XX debug printing was guarded by
>> >>
>> >> bool debout = (sh->sector == 198248960) && __ratelimit(&_rsafasfas);
>> >>
>> >> So everything was for sector 198248960 and rate limited every 20sec to
>> >> avoid a flood.
>> >>
>> >> > Also, please add more DEBUGPRINT() in the
>> >> >
>> >> > if (sh->reconstruct_state == reconstruct_state_result) {
>> >> >
>> >> > case.
>> >>
>> >> OK, added prints there.
>> >>
>> >> Though after logging I noticed that the execution never gets there,
>> >> sh->reconstruct_state is always reconstruct_state_idle at that point.
>> >> It gets cleared on the "XX too many failed" log message (line 4798).
>> >>
>> > I guess the failed = 10 is the problem here..
>> >
>> > What does /proc/mdstat say?
>>
>> After --assemble --backup-file=xx, before --grow:
>>
>> md0 : active raid6 sdac[0] sdf[21] sdh[17] sdj[18] sde[26] sdr[20]
>> sds[15] sdad[25] sdk[13] sdp[27] sdo[11] sdl[10] sdn[9] sdt[16]
>> md8[28]
>> sdi[22] sdae[23] sdaf[24] sdm[3] sdg[2] sdq[1]
>> 74232661248 blocks super 1.1 level 6, 64k chunk, algorithm 2
>> [20/20] [UUUUUUUUUUUUUUUUUUUU]
>> [===================>.] reshape = 97.5%
>> (4024886912/4124036736)
>> finish=10844512.0min speed=0K/sec
>> bitmap: 5/31 pages [20KB], 65536KB chunk
>>
>> After --grow --continue --backup-file=xx (i.e. during the
>> handle_stripe() loop):
>>
>> md0 : active raid6 sdac[0] sdf[21] sdh[17] sdj[18] sde[26] sdr[20]
>> sds[15] sdad[25] sdk[13] sdp[27] sdo[11] sdl[10] sdn[9] sdt[16]
>> md8[28]
>> sdi[22] sdae[23] sdaf[24] sdm[3] sdg[2] sdq[1]
>> 74232661248 blocks super 1.1 level 6, 64k chunk, algorithm 2
>> [20/20] [UUUUUUUUUUUUUUUUUUUU]
>> [===================>.] reshape = 97.5%
>> (4024917256/4124036736)
>> finish=7674.2min speed=215K/sec
>> bitmap: 5/31 pages [20KB], 65536KB chunk
>>
>> After a reboot due to the stuck array the reshape progress gets reset
>> back to 4024886912.
>
> So, I am looking at this part.
>
> [ 274.084437] check 19: state 0x813 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.084969] check 18: state 0xa01 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.085491] check 17: state 0xa01 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.086036] check 16: state 0xa01 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.086569] check 15: state 0x811 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.087088] check 14: state 0xa01 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.087612] check 13: state 0xa01 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.088129] check 12: state 0xa01 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.088652] check 11: state 0x811 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.089167] check 10: state 0x811 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.089691] check 9: state 0x811 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.090209] check 8: state 0x811 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.090727] check 7: state 0xa01 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.091243] check 6: state 0xa01 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.091763] check 5: state 0xa01 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.092279] check 4: state 0xa01 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.092800] check 3: state 0x811 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.093309] check 2: state 0x811 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.093810] check 1: state 0x811 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
> [ 274.094317] check 0: state 0x813 read 00000000b79fab11 write
> 00000000b79fab11 written 00000000b79fab11
>
> So 10 devices don't have R5_Insync here.
>
> Could you please add more prints to the following section:
>
> if (!rdev)
> /* Not in-sync */;
> else if (is_bad) {
> /* also not in-sync */
>
> [...]
>
> else if (test_bit(R5_UPTODATE, &dev->flags) &&
> test_bit(R5_Expanded, &dev->flags))
> /* If we've reshaped into here, we assume it is Insync.
> * We will shortly update recovery_offset to make
> * it official.
> */
> set_bit(R5_Insync, &dev->flags);
Added some more debugging:
http://onse.fi/files/reshape-infloop-issue/dmesg-3.txt
http://onse.fi/files/reshape-infloop-issue/raid5-3.c
with excerpt below.
> I guess we get into "is_bad", case, but it should not be the case?
Right, is_bad is set, which causes R5_Insync and R5_ReadError to be
set
on lines 4497-4498, and R5_Insync to be cleared on line 4554 (if
R5_ReadError then clear R5_Insync).
As mentioned in my first message and seen in
http://onse.fi/files/reshape-infloop-issue/examine-all.txt , the MD
bad
block lists contain blocks (suspiciously identical across devices).
So maybe the code can't properly handle the case where 10 devices have
the same block in their bad block list. Not quite sure what "handle"
should mean in this case but certainly something else than a
handle_stripe() loop :)
There is a "bad" block on 10 devices on sector 198504960, which I
guess
matches sh->sector 198248960 due to data offset of 256000 sectors (per
--examine).
I've wondered if "dd if=/dev/md0 of=/dev/md0" for the affected blocks
would clear the bad blocks and avoid this issue, but I haven't tried
that yet so that the infinite loop issue can be investigated/fixed
first. I already checked that /dev/md0 is fully readable (which also
confuses me a bit since md(8) says "Attempting to read from a known
bad
block will cause a read error"... maybe I'm missing something).