Re: Two drives in RAID6 array experienced similar error at or near beginning of drive

Basil Mohamed Gohar <basilgohar@xxxxxxxxx> · Sat, 19 Jan 2019 10:31:07 -0500

On 1/19/19 9:02 AM, Wols Lists wrote:
On 19/01/19 13:30, Carsten Aulbert wrote:
Hi

On 1/19/19 2:21 PM, Basil Mohamed Gohar wrote:
I have two drives of the 4-array RAID6 visible, but no files are
accessible because it's a RAID6, I need at least 3 of the 4 drives
working, and my problem is two are experiencing this problem.
Hmm, that would be surprising, as RAID6 should offer a two disk
redundancy, i.e. any two disks may fail and you should still be able to
access your data - albeit without any extra safety net.
That was my reaction - raid6 should survive two drive failures. Although
I think *any* drive failure will result in the array failing to start
until you force it - if it's been running degraded it will restart in
the same configuration, if it degrades it won't restart without a force.
Check that out.
This is challenging because it is in a tower array and all the drives
connect straight to motherboard-like backplane.  I took one out and was
working with it directly via a USB SATA adapter, but that did not change
the errors I was seeing.
OK, I just wanted to make sure that the error "stayed" with the drives.

Yes, they are.  SMART reports no fatal errors on the drives in questions!
OK, at least that.

What may help me is if there are any tools for md devices that let me
peek into the on-disk structure.  Since the ext4 file system is spread
across the 3 data drives in the array, I cannot use, for example, e2fsck
on just one of them, and since I cannot properly assemble the drive, I
am somewhat stuck.  Are there any tools for examining an array of drives
even if it is not recognized as such? I don't know, for example, if some
sectors went bad, how to tell mdadm to look in alternate locations
(i.e., akin to ext4's alternative superblock locations).
As indicated above with RAID6 you should "only" have two data disks in a
four disk RAID6, as RAID6 does not write data copies but "generated"
parity stripes to the two extra disks, it can compute back what should
have been on data stripes on failed disks. But reverse engineering this
is probably not really easy to perform "manually".

Thus, at first, we should really establish what the underlying layout
was, i.e. can you send us the output of /proc/mdstat?
Might be too late for that. Two tools that are probably useful are Phil
Turmel's lsdrv, and I saw wipefs mentioned a few days ago here - there's
an option to do nothing that just gives you info.

https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn

Cheers,
Wol

Thanks.  wipefs provided some information that I think may be helpful.  
The two still-reporting drives in the array report as follows:

wipefs /dev/sdi
DEVICE OFFSET        TYPE UUID                                 LABEL
sdi    0x1000        linux_raid_member 
cd6470cb-1aa3-03fd-1027-706e5fd0606d alpha.hidayahonline.net:3
sdi    0x74702555e00 gpt
sdi    0x1fe         PMBR
wipefs /dev/sdg
DEVICE OFFSET        TYPE UUID                                 LABEL
sdg    0x1000        linux_raid_member 
cd6470cb-1aa3-03fd-1027-706e5fd0606d alpha.hidayahonline.net:3
sdg    0x74702555e00 gpt
sdg    0x1fe         PMBR 
For the two drives that are not (sdc & sde), I get nothing but these 
same errors (which I mentioned earlier) in dmesg:

[118345.203138] sd 2:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
[118345.203154] sd 2:0:0:0: [sdc] tag#0 Sense Key : Hardware Error 
[current]
[118345.203158] sd 2:0:0:0: [sdc] tag#0 Add. Sense: Internal target 
failure
[118345.203162] sd 2:0:0:0: [sdc] tag#0 CDB: Read(16) 88 00 00 00 00 
00 00 00 00 08 00 00 00 08 00 00
[118345.203165] print_req_error: critical target error, dev sdc, sector 8
[118345.328209] sd 2:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
[118345.328217] sd 2:0:0:0: [sdc] tag#0 Sense Key : Illegal Request 
[current]
[118345.328222] sd 2:0:0:0: [sdc] tag#0 Add. Sense: Invalid field in cdb
[118345.328228] sd 2:0:0:0: [sdc] tag#0 CDB: Read(16) 88 00 00 00 00 
00 00 00 00 08 00 00 00 08 00 00
[118345.328232] print_req_error: critical target error, dev sdc, sector 8
[118345.328240] Buffer I/O error on dev sdc, logical block 1, async 
page read
[118347.813267] sd 3:0:0:1: [sde] tag#0 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
[118347.813274] sd 3:0:0:1: [sde] tag#0 Sense Key : Medium Error [current]
[118347.813279] sd 3:0:0:1: [sde] tag#0 Add. Sense: Unrecovered read error
[118347.813285] sd 3:0:0:1: [sde] tag#0 CDB: Read(16) 88 00 00 00 00 
00 00 00 00 08 00 00 00 08 00 00
[118347.813288] print_req_error: critical medium error, dev sde, sector 8
[118347.821409] sd 3:0:0:1: [sde] tag#0 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
[118347.821415] sd 3:0:0:1: [sde] tag#0 Sense Key : Medium Error [current]
[118347.821418] sd 3:0:0:1: [sde] tag#0 Add. Sense: Unrecovered read error
[118347.821422] sd 3:0:0:1: [sde] tag#0 CDB: Read(16) 88 00 00 00 00 
00 00 00 00 08 00 00 00 08 00 00
[118347.821425] print_req_error: critical medium error, dev sde, sector 8
[118347.821430] Buffer I/O error on dev sde, logical block 1, async 
page read
My inexperienced suspicious is I have some badblocks in a critical 
portion of the drives where some magic numbers should reside, so they 
appear as "empty" to the system.