Re: RAID1 fail did not work properly with SSDs

"Cal Leeming [Simplicity Media Ltd]" <cal.leeming@xxxxxxxxxxxxxxxxxxxxxxxx> · Thu, 5 Jan 2012 02:18:30 +0000

Hi Neil,

Terribly sorry, I had pasted the wrong lines from mdstat, here is the
correct info:

md1 : active (auto-read-only) raid1 sdd1[0] sda1[1]
      975860 blocks super 1.2 [2/2] [UU]

Also, I don't know if this is related and will probably sound crazy
but, every single disk in the server (there was another unrelated
RAID1 with non SDDs - sdb and sdc) were reporting this same error, but
the moment I disabled the broken SSD in BIOS, it stopped doing this.

 root@vicky [/sbin] > dmesg | grep sda | grep "I/O error" | wc -l
445

 root@vicky [/sbin] > dmesg | grep sdb | grep "I/O error" | wc -l
2

 root@vicky [/sbin] > dmesg | grep sdc | grep "I/O error" | wc -l
2

 root@vicky [/sbin] > dmesg | grep sdd | grep "I/O error" | wc -l
2

 root@vicky [/sbin] >

And here's the really crazy thing.. the broken SSD was actually
/dev/sdd, not /dev/sda.

I did a badblocks check on both, sdd failed and sda worked fine.
Removed sdd, and the I/O error problem disappeared on both sdd and
sda.

Could this be the reason why it ended up being placed into read-only
mode? Because the kernel detected that the controller was saying that
both SSDs were giving this same "I/O Error" (despite it being caused
by a single drive)??

Cal

On Thu, Jan 5, 2012 at 2:00 AM, NeilBrown <neilb@xxxxxxx> wrote:
> On Thu, 5 Jan 2012 01:44:10 +0000 "Cal Leeming [Simplicity Media Ltd]"
> <cal.leeming@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
>> Hi all,
>>
>> My apologies if this is the wrong mailing list for this issue, but I
>> figured my email would be lost in volume if I sent to 'linux-kernel'.
>
> too true!!
>
>>
>> In short, I had 2 SSDs in RAID 1, allocated as a single physical
>> volume, which had a LVM logical volume mounted as the root partition.
>>
>> Six months later, one of the SSDs dies, and causes all of hell to break lose:
>>
>> [27087.234675] sd 0:0:0:0: [sda] Unhandled error code
>> [27087.234686] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET
>> driverbyte=DRIVER_OK
>> [27087.234688] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 68 53 88 00 00 08 00
>> [27087.234693] end_request: I/O error, dev sda, sector 6837128
>                                         ^^^^^^^^
>
> "sda".
>
>> ^^ repeated over 9000 times
>>
>> Instead of the disk being marked as failed and removed, the root
>> partition was instead remounted as read-only, mdadm showed no
>> problems, and required a reboot.
>>
>> Upon rebooting, RAID still hadn't marked the dying disk as failed or
>> removed, and began to re-sync!
>>
>>  root@vicky [/var/log] > cat /proc/mdstat
>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
>> md0 : active (auto-read-only) raid1 sdb1[0] sdc1[1]
>                                      ^^^^^^^^^^^^^^^
>
> "sdb" and "sdc".
>
> Something is missing in this picture.
>
> NeilBrown
>
>
>>       78122967 blocks super 1.2 [2/2] [UU]
>>
>> On top of this, even though it was read-only, it kept giving this
>> error for everything:
>>
>>  root@vicky [/var/log] > shutdown
>> bash: /sbin/shutdown: Input/output error
>>
>> I'm not sure if what I'm seeing here is normal, but thought I should
>> at least try and ask - I can provide lots more info if needed (got a
>> huge text file and several screenshots).
>>
>> Any feedback would be very much appreciated.
>>
>> Cal Leeming
>> Simplicity Media Ltd
>>
>> ----------------------------
>>
>> Here is the short smartctl dump of the disk:
>>
>>  root@vicky [/home/foxx] > smartctl -a /dev/sda
>> smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
>> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>>
>> === START OF INFORMATION SECTION ===
>> Device Model:     M4-CT128M4SSD2
>> Serial Number:    00000000111603061D7B
>> Firmware Version: 0001
>> User Capacity:    128,035,676,160 bytes
>> Device is:        Not in smartctl database [for details use: -P showall]
>> ATA Version is:   8
>> ATA Standard is:  ATA-8-ACS revision 6
>> Local Time is:    Tue Jan  3 13:54:46 2012 GMT
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html