Re: strange problem with raid6 read errors on active non-degraded array

Pedro Teixeira <finas@xxxxxxxx> · Thu, 03 Jul 2014 09:29:04 +0100

Hi Neil,

   Thanks for the very informative answer, that nailed it, and Ethan  
was obviously onto it too!

   I tried running the commands you posted and it gives me an error:

   bb.sh
   "
     sed 's/^/-/' /sys/block/md0/md/dev-sdq/bad_blocks
     while read; do
        echo $a > /sys/block/md0/md/dev-sdq/bad_blocks
     done
   "
   root@nas3:~# ./bb.sh
   -211996592 128
   ./bb.sh: line 3: echo: write error: Invalid argument
   "
   Can you help me with this?

   I will clear all the bad blocks on all the drives and force a  
repair and see if some error shows up.  If not, I will then fsck the  
filesystem.

   I'm not sure how the volume failed. On one friday morning ( past  
month ) I checked the system and everything was ok ( no dmesg errors  
and mdastat repoted all disks up ). next monday I got a call telling  
me that the volume was inacessible. When I got back the next thursday,  
the machine had already  been rebooted and the md0 volume had three  
failed disks. I did a --examine and two of them were completly off in  
terms of events regarding the non-failed disks. the other one was much  
more close, but still a bit off. Not close enough to do a --assemble  
--force, so I recreated the array with something like this:

"mdadm --create --assume-clean --level=6 --raid-devices=16  
--name=nas3:Datastore --uuid=9e97c588:59135324:c7d3fdf6:e543bdc3  
/dev/md0 /dev/sde /dev/sdc /dev/sdd /dev/sdb /dev/sdf /dev/sdg  
/dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl missing /dev/sdn missing  
/dev/sdp /dev/sdq". I think the last failed drive was sdl or sdn,  
can't remember.

then I cleared the superblocks on the missing disks and readded them.  
Then I fsck'd the filesystem and I started getting those errors. I  
since then replaced them with new disks and tested the old ones only  
to find that they have no smart errors reported ( smart is enabled in  
bios ) and I also did a read-write test to them and I found them to be  
ok.

I will rebuild this machine next weekend, or the one after that, to  
try to sort out some hardware problem or issues with the cabling, but  
I am inclined to say that maybe it's related to the sshd's.

Cheers
Pedro

   Citando NeilBrown <neilb@xxxxxxx>:
On Wed, 02 Jul 2014 12:54:34 +0100 Pedro Teixeira <finas@xxxxxxxx> wrote:
cpu is a phenom x6, 8gb ram. controller is LSI 9201-i16. hdd's are
   seagate sshd ST1000DX001.

   So I run the "dd if=/dev/md0 of=/dev/null  bs=4096" and it failed on
   alot of places. I had to restart the command several times with the
   skip parameter set to a couple of blocks after the last block error.
   It run for about 1.5TB of the total 13TB of the volume.
   The md volume didn't drop any drive when running this.

   dmesg showed:

   [ 1678.478156] Buffer I/O error on device md0, logical block 196012546
  I love numbers, thanks.
  The logical block size is 4096, or 8 sectors (1 sector is defined as 512
  bytes), so this is at
  196012546*8 == 1568100368 sectors into the array.

  The array has a chunksize of 512K, or 1024 sectors so
  196012546*8/1024 = 1531348.015625

  gives us the chunk number, and the remaining fraction of a chunk.

  The RAID6 has 16 devices, so there are 14 data chunks in each stripe, so to
  find where the above chunk is stored we divide by 14

    1531348/14 = 109382.0000

  So that is chunk 109382 on the first device (though with rotating data,
  it might not be the very first).

  Add back in the factional part, multiple by 1024 sectors per chunk, and add
  the Data Offset,

  109382.01562500*1024+262144 = 112269328

  So it seems that sector 112269328 on some device is bad.
The command "mdadm --examine-badblocks /dev/sd[bcdefghijklmnopqr] >>
   raid.b" before and after running the "dd" command returned no changes:
  I didn't notice the fact that the bad block logs were not empty  
before, sorry.
  Anyway:...  > Bad-blocks on /dev/sdb:
               112269328 for 512 sectors
  Look at that - exactly the number I calculated.  I love it when that works
  out.

  So the problem is exactly that some blocks are thought by md to be bad.

  Blocks get recorded as bad (for raid6) when:

  - a 'read' reported an error which could not be fixed, either
    because the array was degraded so the data could not be recovered,
    or because the attempt to write restored data failed
  - when recovering a spare, if the data to be written cannot be  
found (due to
    errors on other devices)
  - when a 'write' request to a device fails

  When your array had three failed devices, some reads and writes would have
  failed.  Maybe that caused the bad blocks to be recorded.
  What sort of devices failures where they?  If the device became completely
  inaccessible, then it would not have been possible to record the bad block
  information.

  Can you describe the sequence of events that lead to the three failures?
  When you put the array back together, did you --create it, or --assemble
  --force?

  There isn't an easy way to remove the bad block list, as doing so  
is normally
  asking for data corruption.
  However it is probably justified in your case.
  As it happens I included code in the kernel to make it possible to  
remove bad
  blocks from the list - it was intended for testing only but I never removed
  it.
  If you run
  sed 's/^/-/' /sys/block/md0/md/dev-sdq/bad_blocks |
  while read; do
      echo $a > /sys/block/md0/md/dev-sdq/bad_blocks
  done

  then it should clear all of the bad blocks recorded  on sdq.
  You should probably fail/remove the last two devices that you added to the
  array before you do this, as they probably don't have properly uptodate
  information and doing this will cause corruption.

  I probably need to think about better ways to handle the bad block lists.
  NeilBrown

________________________________________________________________________________
Mensagem enviada através do email grátis AEIOU
http://www.aeiou.pt
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html