On Wed, Oct 30, 2019 at 11:25 AM Anssi Hannula <anssi.hannula@xxxxxx> wrote: > > Song Liu kirjoitti 2019-10-29 23:55: > > On Tue, Oct 29, 2019 at 1:45 PM Anssi Hannula <anssi.hannula@xxxxxx> > > wrote: > >> > >> Song Liu kirjoitti 2019-10-29 22:28: > >> > On Tue, Oct 29, 2019 at 12:05 PM Anssi Hannula <anssi.hannula@xxxxxx> > >> > wrote: > >> >> > >> >> Song Liu kirjoitti 2019-10-29 08:04: > >> >> > I guess we get into "is_bad", case, but it should not be the case? > >> >> > >> >> Right, is_bad is set, which causes R5_Insync and R5_ReadError to be > >> >> set > >> >> on lines 4497-4498, and R5_Insync to be cleared on line 4554 (if > >> >> R5_ReadError then clear R5_Insync). > >> >> > >> >> As mentioned in my first message and seen in > >> >> http://onse.fi/files/reshape-infloop-issue/examine-all.txt , the MD > >> >> bad > >> >> block lists contain blocks (suspiciously identical across devices). > >> >> So maybe the code can't properly handle the case where 10 devices have > >> >> the same block in their bad block list. Not quite sure what "handle" > >> >> should mean in this case but certainly something else than a > >> >> handle_stripe() loop :) > >> >> There is a "bad" block on 10 devices on sector 198504960, which I > >> >> guess > >> >> matches sh->sector 198248960 due to data offset of 256000 sectors (per > >> >> --examine). > >> > > >> > OK, it makes sense now. I didn't add the data offset when checking the > >> > bad > >> > block data. > >> > > >> >> > >> >> I've wondered if "dd if=/dev/md0 of=/dev/md0" for the affected blocks > >> >> would clear the bad blocks and avoid this issue, but I haven't tried > >> >> that yet so that the infinite loop issue can be investigated/fixed > >> >> first. I already checked that /dev/md0 is fully readable (which also > >> >> confuses me a bit since md(8) says "Attempting to read from a known > >> >> bad > >> >> block will cause a read error"... maybe I'm missing something). > >> >> > >> > > >> > Maybe try these steps? > >> > > >> > https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy#How_do_I_fix_a_Bad_Blocks_problem.3F > >> > >> Yeah, I guess those steps would probably resolve my situation. BTW, > >> "--update=force-no-bbl" is not mentioned on mdadm(8), is it on > >> purpose? > >> I was trying to find such an option earlier. > >> > >> If you don't need anything more from the array, I'll go ahead and try > >> clearing the seemingly bogus bad block lists. > > > > Please go ahead. We already got quite a few logs. > > Seems that was indeed the issue, clearing the bad block log allowed the > reshape to continue normally. That's great news! Song