Re: RAID1 fail did not work properly with SSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 5 Jan 2012 02:18:30 +0000 "Cal Leeming [Simplicity Media Ltd]"
<cal.leeming@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:

> Hi Neil,
> 
> Terribly sorry, I had pasted the wrong lines from mdstat, here is the
> correct info:
> 
> md1 : active (auto-read-only) raid1 sdd1[0] sda1[1]
>       975860 blocks super 1.2 [2/2] [UU]

That makes more sense.

However the error message was:

[27087.234693] end_request: I/O error, dev sda, sector 6837128

md1 is only 975860 (1K) blocks, or 1951720 sectors.
So unless it starts a long way into the device, this error was from a
completely different location to the array....

There are 128GB devices - yes?  and md1 is 1 GB.  So what is using the
remaining 127GB?

> 
> Also, I don't know if this is related and will probably sound crazy
> but, every single disk in the server (there was another unrelated
> RAID1 with non SDDs - sdb and sdc) were reporting this same error, but
> the moment I disabled the broken SSD in BIOS, it stopped doing this.

It isn't unknowns for one  bad device to confuse all the other devices on the
same bus, or the same controller.

> 
>  root@vicky [/sbin] > dmesg | grep sda | grep "I/O error" | wc -l
> 445
> 
>  root@vicky [/sbin] > dmesg | grep sdb | grep "I/O error" | wc -l
> 2
> 
>  root@vicky [/sbin] > dmesg | grep sdc | grep "I/O error" | wc -l
> 2
> 
>  root@vicky [/sbin] > dmesg | grep sdd | grep "I/O error" | wc -l
> 2
> 
>  root@vicky [/sbin] >
> 
> And here's the really crazy thing.. the broken SSD was actually
> /dev/sdd, not /dev/sda.
> 
> I did a badblocks check on both, sdd failed and sda worked fine.
> Removed sdd, and the I/O error problem disappeared on both sdd and
> sda.
> 
> Could this be the reason why it ended up being placed into read-only
> mode? Because the kernel detected that the controller was saying that
> both SSDs were giving this same "I/O Error" (despite it being caused
> by a single drive)??

The devices aren't read-only.

"auto-read-only" means they are pretending to be read-only at the moment but
as soon as you write something they with automatically switch to read-write
mode.

While they are (pretending to be) read-only they won't do any resync/recovery
etc.  i.e. they won't write to any device at all.  This is generally a safe
way to start md arrays as if a wrong array is started by mistake it won't be
written to until you e.g. try to mount it.

It really looks like nothing is trying to write to 'md1'.

Maybe you need to give us all the details...
  cat /proc/mdstat
  cat /proc/partitions
  cat /etc/fstab 
 ....


NeilBrown


> 
> Cal
> 
> 
> On Thu, Jan 5, 2012 at 2:00 AM, NeilBrown <neilb@xxxxxxx> wrote:
> > On Thu, 5 Jan 2012 01:44:10 +0000 "Cal Leeming [Simplicity Media Ltd]"
> > <cal.leeming@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> >
> >> Hi all,
> >>
> >> My apologies if this is the wrong mailing list for this issue, but I
> >> figured my email would be lost in volume if I sent to 'linux-kernel'.
> >
> > too true!!
> >
> >>
> >> In short, I had 2 SSDs in RAID 1, allocated as a single physical
> >> volume, which had a LVM logical volume mounted as the root partition.
> >>
> >> Six months later, one of the SSDs dies, and causes all of hell to break lose:
> >>
> >> [27087.234675] sd 0:0:0:0: [sda] Unhandled error code
> >> [27087.234686] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET
> >> driverbyte=DRIVER_OK
> >> [27087.234688] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 68 53 88 00 00 08 00
> >> [27087.234693] end_request: I/O error, dev sda, sector 6837128
> >                                         ^^^^^^^^
> >
> > "sda".
> >
> >> ^^ repeated over 9000 times
> >>
> >> Instead of the disk being marked as failed and removed, the root
> >> partition was instead remounted as read-only, mdadm showed no
> >> problems, and required a reboot.
> >>
> >> Upon rebooting, RAID still hadn't marked the dying disk as failed or
> >> removed, and began to re-sync!
> >>
> >>  root@vicky [/var/log] > cat /proc/mdstat
> >> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
> >> md0 : active (auto-read-only) raid1 sdb1[0] sdc1[1]
> >                                      ^^^^^^^^^^^^^^^
> >
> > "sdb" and "sdc".
> >
> > Something is missing in this picture.
> >
> > NeilBrown
> >
> >
> >>       78122967 blocks super 1.2 [2/2] [UU]
> >>
> >> On top of this, even though it was read-only, it kept giving this
> >> error for everything:
> >>
> >>  root@vicky [/var/log] > shutdown
> >> bash: /sbin/shutdown: Input/output error
> >>
> >> I'm not sure if what I'm seeing here is normal, but thought I should
> >> at least try and ask - I can provide lots more info if needed (got a
> >> huge text file and several screenshots).
> >>
> >> Any feedback would be very much appreciated.
> >>
> >> Cal Leeming
> >> Simplicity Media Ltd
> >>
> >> ----------------------------
> >>
> >> Here is the short smartctl dump of the disk:
> >>
> >>  root@vicky [/home/foxx] > smartctl -a /dev/sda
> >> smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
> >> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
> >>
> >> === START OF INFORMATION SECTION ===
> >> Device Model:     M4-CT128M4SSD2
> >> Serial Number:    00000000111603061D7B
> >> Firmware Version: 0001
> >> User Capacity:    128,035,676,160 bytes
> >> Device is:        Not in smartctl database [for details use: -P showall]
> >> ATA Version is:   8
> >> ATA Standard is:  ATA-8-ACS revision 6
> >> Local Time is:    Tue Jan  3 13:54:46 2012 GMT
> >> SMART support is: Available - device has SMART capability.
> >> SMART support is: Enabled
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux