Re: strange problem with raid6 read errors on active non-degraded array

NeilBrown <neilb@xxxxxxx> · Thu, 3 Jul 2014 12:40:46 +1000

On Wed, 02 Jul 2014 12:54:34 +0100 Pedro Teixeira <finas@xxxxxxxx> wrote:

> cpu is a phenom x6, 8gb ram. controller is LSI 9201-i16. hdd's are  
> seagate sshd ST1000DX001.
> 
> So I run the "dd if=/dev/md0 of=/dev/null  bs=4096" and it failed on  
> alot of places. I had to restart the command several times with the  
> skip parameter set to a couple of blocks after the last block error.  
> It run for about 1.5TB of the total 13TB of the volume.
> The md volume didn't drop any drive when running this.
> 
> dmesg showed:
> 
> [ 1678.478156] Buffer I/O error on device md0, logical block 196012546

I love numbers, thanks.
The logical block size is 4096, or 8 sectors (1 sector is defined as 512
bytes), so this is at 
  196012546*8 == 1568100368 sectors into the array.

The array has a chunksize of 512K, or 1024 sectors so
 196012546*8/1024 = 1531348.015625

gives us the chunk number, and the remaining fraction of a chunk.

The RAID6 has 16 devices, so there are 14 data chunks in each stripe, so to
find where the above chunk is stored we divide by 14

   1531348/14 = 109382.0000

So that is chunk 109382 on the first device (though with rotating data,
it might not be the very first).

Add back in the factional part, multiple by 1024 sectors per chunk, and add
the Data Offset,

  109382.01562500*1024+262144 = 112269328

So it seems that sector 112269328 on some device is bad.

> The command "mdadm --examine-badblocks /dev/sd[bcdefghijklmnopqr] >>   
> raid.b" before and after running the "dd" command returned no changes:
> 

I didn't notice the fact that the bad block logs were not empty before, sorry.
Anyway:...
> 
> Bad-blocks on /dev/sdb:
>             112269328 for 512 sectors

Look at that - exactly the number I calculated.  I love it when that works
out.

So the problem is exactly that some blocks are thought by md to be bad.

Blocks get recorded as bad (for raid6) when:

 - a 'read' reported an error which could not be fixed, either
   because the array was degraded so the data could not be recovered,
   or because the attempt to write restored data failed
 - when recovering a spare, if the data to be written cannot be found (due to
   errors on other devices)
 - when a 'write' request to a device fails

When your array had three failed devices, some reads and writes would have
failed.  Maybe that caused the bad blocks to be recorded.
What sort of devices failures where they?  If the device became completely
inaccessible, then it would not have been possible to record the bad block
information.

Can you describe the sequence of events that lead to the three failures?
When you put the array back together, did you --create it, or --assemble
--force?

There isn't an easy way to remove the bad block list, as doing so is normally
asking for data corruption.
However it is probably justified in your case.
As it happens I included code in the kernel to make it possible to remove bad
blocks from the list - it was intended for testing only but I never removed
it.
If you run
  sed 's/^/-/' /sys/block/md0/md/dev-sdq/bad_blocks | 
  while read; do
     echo $a > /sys/block/md0/md/dev-sdq/bad_blocks
  done

then it should clear all of the bad blocks recorded  on sdq.
You should probably fail/remove the last two devices that you added to the
array before you do this, as they probably don't have properly uptodate
information and doing this will cause corruption.

I probably need to think about better ways to handle the bad block lists.

NeilBrown

Attachment:
signature.asc

Description: PGP signature