Re: RAID6 gets stuck during reshape with 100% CPU

"Wol's lists" <antlists@xxxxxxxxxxxxxxx> · Tue, 29 Oct 2019 21:52:05 +0000

On 29/10/2019 19:05, Anssi Hannula wrote:
As mentioned in my first message and seen in 
http://onse.fi/files/reshape-infloop-issue/examine-all.txt , the MD bad 
block lists contain blocks (suspiciously identical across devices).
So maybe the code can't properly handle the case where 10 devices have 
the same block in their bad block list. Not quite sure what "handle" 
should mean in this case but certainly something else than a 
handle_stripe() loop :)
There is a "bad" block on 10 devices on sector 198504960, which I guess 
matches sh->sector 198248960 due to data offset of 256000 sectors (per 
--examine).

I've wondered if "dd if=/dev/md0 of=/dev/md0" for the affected blocks 
would clear the bad blocks and avoid this issue, but I haven't tried 
that yet so that the infinite loop issue can be investigated/fixed 
first. I already checked that /dev/md0 is fully readable (which also 
confuses me a bit since md(8) says "Attempting to read from a known bad 
block will cause a read error"... maybe I'm missing something).

Hmmm ...

Bear in mind that bad-blocks is considered by many an anti-feature, and 
it's strongly suspected that identical bad-block lists across multiple 
disks is a bug ...

I hesitate to suggest trying to clear the bad-blocks but doing a dd will 
definitely not do what you want - the md bad blocks list is implemented 
within the md layer, so doing something with dd is unlikely to touch it.

Plus, as a software implementation, you should NEVER under normal 
circumstances have any bad blocks - it doesn't make sense - so it's 
pretty certain you've fallen foul of a bug in the bad blocks setup.

Sorry I can't offer any solutions, other than very hesitantly suggesting 
just a --remove-badblocks --force or whatever the option is.

Hopefully this gives you a few ideas ...

Cheers,
Wol