Re: Troubleshooting "Buffer I/O error" on reading md device

NeilBrown <neilb@xxxxxxxx> · Wed, 03 Jan 2018 08:27:59 +1100

On Tue, Jan 02 2018, RQM wrote:

> Hello,
>
> thanks for the quick and helpful responses! Answers inline:
>
> > Step one is confirm that it is easy to reproduce.
>> Does
>> dd if=/dev/md0 bs=4K skip=1598030208 count=1 of=/dev/null
>>
>> trigger the message reliably?
>> To check that "4K" is the correct blocksize, run
>> blockdev --getbsz /dev/md0
>>
>> use whatever number if gives as 'bs='.
>
>
> blockdev does indeed report a blocksize of 4096, and the dd line does reliably trigger
> dd: error reading '/dev/md0': Input/output error
> and the same line in dmesg as before.
>
>> Once you can reproduce with minimal IO, do
>> echo file:raid5.c +p > /sys/kernel/debug/dynamic_debug/control
>>repeat experiment
>>
>>echo file:raid5.c -p > /sys/kernel/debug/dynamic_debug/control
>>
>> and report the messages that appear in 'dmesg'.
>
> I had to replace the colon with a space in those two lines (otherwise I would get "bash: echo: write error: Invalid argument"), but after that, this is what I got in dmesg:
> https://paste.ubuntu.com/26305369/

[Tue Jan  2 11:14:47 2018] locked=0 uptodate=0 to_read=1 to_write=0 failed=2 failed_num=3,2

So for this stripe. Two devices appear to be failed: 3 and 2.
As the two devices clearly are thought to be working there must be a bad
block recorded.

>
>> Also report "mdadm -E" of each member device, and kernel version (though
>> I see that is in the serverfault report :  4.9.30-2+deb9u5).
>
> mdadm -E says: https://paste.ubuntu.com/26305379/

I needed "mdadm -E" the components of the array, so the partitions
rather than the whole devices. e.g. /dev/sdb1, not /dev/sdb.

This will show a non-empty bad block list on at least two devices.

You can remove the bad block by over-writing it.
  dd if=/dev/zero of=/dev/md0 bs=4K seek=1598030208 count=1
though that might corrupt some file containing the block.

(note "seek" seeks in the output file, "skip" skips over the input
file).

How did the bad block get there?
A possible scenario is:
 - A device fails and is removed from array
 - read error occurs on another device.  Rather than failing the whole
   device, md records that block as bad.
 - failed device is replaced (or found to be a cabling problem) and
   recovered.  Due to the bad block the stripe cannot be recovered,
   so a bad block is recorded in the new device.

If the read error was really a cabling problem, then the original data
might still be there.  If it is, you could recover it and write it back
to the array rather then writing from /dev/zero.
Finding out which file the failed block is part of is probably possible,
but not necessarily easy.  If you want to try, the first step is
reporting what filesystem is on md0.  If it is ext4, then debugfs can
help.  If something else - I don't know.

NeilBrown

> The kernel has been updated between the serverfault post and my first mail to this list to 4.9.65-3+deb9u1. No changes since.
>
>>
>> Then run
>> blktrace /dev/md0 /dev/sd[acdef]
>> in one window while reproducing the error again in another window.
>> Then interrupt the blktrace.  This will produce several blocktrace*
>> files.  create a tar.gz of these and put them somewhere that I can get
>> them - hopefully they won't be too big.
>
> I had to adjust the last blktrace argument to /dev/sd[b-f] since after the last reboot the names of the drives have changed, but here's the output:
> https://filebin.ca/3mnjUz1OIXqm/blktrace-out.tar.gz
> I also included the blktrace terminal output in there.
>
> Thank you so much for the effort! Please let me know if you need anything.
Attachment:
signature.asc

Description: PGP signature