On Mon, Jan 01 2018, RQM wrote: > Hello everyone, > > I hope this list is the right place to ask the following: > > I've got a 5-disk RAID-5 array that's been built by a QNAP NAS device, which has recently failed (I suspect a faulty SATA controller or backplane). > I migrated the disks to a desktop computer that runs Debian stretch (kernel 4.9.65-3+deb9u1 amd64) and mdadm version 3.4. Although the array can be assembled, I encountered the following error in my dmesg output ([1], recorded directly after a recent reboot and fsck attempt) when running fsck: > > Buffer I/O error on dev md0, logical block 1598030208, async page read > > I can reliably reproduce that error by trying to read from the md0 device. It's always the same block, also across reboots. > > I have suspected that possibly, one of the drives involved is faulty. Although smart errors have been logged [2], the errors are not recent enough to correlate with the fsck run. Also, I had sha1sum complete without error on every one of the individual disk devices /dev/sd[b-f], so reading from the drives does not provoke an error. > > Finally, I tried scrubbing the array by writing repair to md/sync_action. The process completed without any output to dmesg or signs of trouble in /proc/mdstat. However, reading from the array still fails at the same block as above, 1598030208. > > Here's the output of mdadm --detail /dev/md0: [3] > > I assume the md driver would know what exactly the problem is, but I don't know where to look to find that information. How can I proceed troubleshooting this issue? > > FYI, I had posted this on serverfault [4] previously, but unfortunately didn't arrive at a conclusion. > > Thank you very much in advance! > > [1] https://paste.ubuntu.com/26303735/ > [2] https://paste.ubuntu.com/26303737/ > [3] https://paste.ubuntu.com/26303754/ > [4] https://serverfault.com/questions/889687/troubleshooting-buffer-i-o-error-on-software-raid-md-device This is truly weird. I'd even go so far as to say that it cannot possibly happen (but I've been wrong before). Step one is confirm that it is easy to reproduce. Does dd if=/dev/md0 bs=4K skip=1598030208 count=1 of=/dev/null trigger the message reliably? To check that "4K" is the correct blocksize, run blockdev --getbsz /dev/md0 use whatever number if gives as 'bs='. If you cannot reproduce like that, try a larger count and then a smaller skip with a large count. Once you can reproduce with minimal IO, do echo file:raid5.c +p > /sys/kernel/debug/dynamic_debug/control # repeat experiment echo file:raid5.c -p > /sys/kernel/debug/dynamic_debug/control and report the messages that appear in 'dmesg'. Also report "mdadm -E" of each member device, and kernel version (though I see that is in the serverfault report : 4.9.30-2+deb9u5). Then run blktrace /dev/md0 /dev/sd[acdef] in one window while reproducing the error again in another window. Then interrupt the blktrace. This will produce several blocktrace* files. create a tar.gz of these and put them somewhere that I can get them - hopefully they won't be too big. With all this information, I can poke around and will hopefully be able to explain if fine detail exactly why this cannot possible happen (unless it turns out that I'm wrong again). Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature