On Tue, Oct 29, 2019 at 12:05 PM Anssi Hannula <anssi.hannula@xxxxxx> wrote: > > Song Liu kirjoitti 2019-10-29 08:04: > > On Sat, Oct 26, 2019 at 3:07 AM Anssi Hannula <anssi.hannula@xxxxxx> > > wrote: > >> > >> Song Liu kirjoitti 2019-10-25 01:56: > >> > On Thu, Oct 24, 2019 at 12:42 PM Anssi Hannula <anssi.hannula@xxxxxx> > >> > wrote: > >> >> Song Liu kirjoitti 2019-10-24 21:50: > >> >> > On Sat, Oct 19, 2019 at 2:10 AM Anssi Hannula <anssi.hannula@xxxxxx> > >> >> > wrote: > >> >> >> > >> >> >> Hi all, > >> >> >> > >> >> >> I'm seeing a reshape issue where the array gets stuck with requests > >> >> >> seemingly getting blocked and md0_raid6 process taking 100% CPU > >> >> >> whenever > >> >> >> I --continue the reshape. > >> >> >> > >> >> >> From what I can tell, the md0_raid6 process is stuck processing a set > >> >> >> of > >> >> >> stripes over and over via handle_stripe() without progressing. > >> >> >> > >> >> >> Log excerpt of one handle_stripe() of an affected stripe with some > >> >> >> extra > >> >> >> logging is below. > >> >> >> The 4600-5200 integers are line numbers for > >> >> >> http://onse.fi/files/reshape-infloop-issue/raid5.c . > >> >> > > >> >> > Maybe add sh->sector to DEBUGPRINT()? > >> >> > >> >> Note that the XX debug printing was guarded by > >> >> > >> >> bool debout = (sh->sector == 198248960) && __ratelimit(&_rsafasfas); > >> >> > >> >> So everything was for sector 198248960 and rate limited every 20sec to > >> >> avoid a flood. > >> >> > >> >> > Also, please add more DEBUGPRINT() in the > >> >> > > >> >> > if (sh->reconstruct_state == reconstruct_state_result) { > >> >> > > >> >> > case. > >> >> > >> >> OK, added prints there. > >> >> > >> >> Though after logging I noticed that the execution never gets there, > >> >> sh->reconstruct_state is always reconstruct_state_idle at that point. > >> >> It gets cleared on the "XX too many failed" log message (line 4798). > >> >> > >> > I guess the failed = 10 is the problem here.. > >> > > >> > What does /proc/mdstat say? > >> > >> After --assemble --backup-file=xx, before --grow: > >> > >> md0 : active raid6 sdac[0] sdf[21] sdh[17] sdj[18] sde[26] sdr[20] > >> sds[15] sdad[25] sdk[13] sdp[27] sdo[11] sdl[10] sdn[9] sdt[16] > >> md8[28] > >> sdi[22] sdae[23] sdaf[24] sdm[3] sdg[2] sdq[1] > >> 74232661248 blocks super 1.1 level 6, 64k chunk, algorithm 2 > >> [20/20] [UUUUUUUUUUUUUUUUUUUU] > >> [===================>.] reshape = 97.5% > >> (4024886912/4124036736) > >> finish=10844512.0min speed=0K/sec > >> bitmap: 5/31 pages [20KB], 65536KB chunk > >> > >> After --grow --continue --backup-file=xx (i.e. during the > >> handle_stripe() loop): > >> > >> md0 : active raid6 sdac[0] sdf[21] sdh[17] sdj[18] sde[26] sdr[20] > >> sds[15] sdad[25] sdk[13] sdp[27] sdo[11] sdl[10] sdn[9] sdt[16] > >> md8[28] > >> sdi[22] sdae[23] sdaf[24] sdm[3] sdg[2] sdq[1] > >> 74232661248 blocks super 1.1 level 6, 64k chunk, algorithm 2 > >> [20/20] [UUUUUUUUUUUUUUUUUUUU] > >> [===================>.] reshape = 97.5% > >> (4024917256/4124036736) > >> finish=7674.2min speed=215K/sec > >> bitmap: 5/31 pages [20KB], 65536KB chunk > >> > >> After a reboot due to the stuck array the reshape progress gets reset > >> back to 4024886912. > > > > So, I am looking at this part. > > > > [ 274.084437] check 19: state 0x813 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.084969] check 18: state 0xa01 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.085491] check 17: state 0xa01 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.086036] check 16: state 0xa01 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.086569] check 15: state 0x811 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.087088] check 14: state 0xa01 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.087612] check 13: state 0xa01 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.088129] check 12: state 0xa01 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.088652] check 11: state 0x811 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.089167] check 10: state 0x811 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.089691] check 9: state 0x811 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.090209] check 8: state 0x811 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.090727] check 7: state 0xa01 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.091243] check 6: state 0xa01 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.091763] check 5: state 0xa01 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.092279] check 4: state 0xa01 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.092800] check 3: state 0x811 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.093309] check 2: state 0x811 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.093810] check 1: state 0x811 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > [ 274.094317] check 0: state 0x813 read 00000000b79fab11 write > > 00000000b79fab11 written 00000000b79fab11 > > > > So 10 devices don't have R5_Insync here. > > > > Could you please add more prints to the following section: > > > > if (!rdev) > > /* Not in-sync */; > > else if (is_bad) { > > /* also not in-sync */ > > > > [...] > > > > else if (test_bit(R5_UPTODATE, &dev->flags) && > > test_bit(R5_Expanded, &dev->flags)) > > /* If we've reshaped into here, we assume it is Insync. > > * We will shortly update recovery_offset to make > > * it official. > > */ > > set_bit(R5_Insync, &dev->flags); > > > Added some more debugging: > http://onse.fi/files/reshape-infloop-issue/dmesg-3.txt > http://onse.fi/files/reshape-infloop-issue/raid5-3.c > with excerpt below. > > > I guess we get into "is_bad", case, but it should not be the case? > > Right, is_bad is set, which causes R5_Insync and R5_ReadError to be set > on lines 4497-4498, and R5_Insync to be cleared on line 4554 (if > R5_ReadError then clear R5_Insync). > > As mentioned in my first message and seen in > http://onse.fi/files/reshape-infloop-issue/examine-all.txt , the MD bad > block lists contain blocks (suspiciously identical across devices). > So maybe the code can't properly handle the case where 10 devices have > the same block in their bad block list. Not quite sure what "handle" > should mean in this case but certainly something else than a > handle_stripe() loop :) > There is a "bad" block on 10 devices on sector 198504960, which I guess > matches sh->sector 198248960 due to data offset of 256000 sectors (per > --examine). OK, it makes sense now. I didn't add the data offset when checking the bad block data. > > I've wondered if "dd if=/dev/md0 of=/dev/md0" for the affected blocks > would clear the bad blocks and avoid this issue, but I haven't tried > that yet so that the infinite loop issue can be investigated/fixed > first. I already checked that /dev/md0 is fully readable (which also > confuses me a bit since md(8) says "Attempting to read from a known bad > block will cause a read error"... maybe I'm missing something). > Maybe try these steps? https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy#How_do_I_fix_a_Bad_Blocks_problem.3F Thanks, Song