mirrors not forming at reboot (sometimes)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I'd like some advise please on how to troubleshoot this further.   I
have 4 x HP DL580 servers configured with dual P400 smart array
controller cards and I'm using MD Raid to mirror two partitions for
/boot and the LVM system volume group across both controllers.
They're all running kernel 2.6.18-128.1.14.el5 and mdadm-2.6.4-1.el5.
Their hardware and patch levels are identical.

3 of the servers are fine, but one of them will occasionally fail to
construct the mirrors on reboot and I'm scratching my head as to why.
It's always the disk partitions from the same controller that are
missing.  They add in afterwards ok and then everything runs fine
until the system reboots.  The systems aren't out of development and
in production just yet, so they get rebooted more frequently than they
would normally.

Lacking any evidence as to why this might be I've bumped up the kernel
logging level at boot time to as high as it will go so I can see the
following.   The first example - from another system - is what it
should look like when it works properly.


On sdorac2a  (good)

Nov  4 13:49:52 sdorac2a kernel: md: Autodetecting RAID arrays.
Nov  4 13:49:52 sdorac2a kernel: md: autorun ...
Nov  4 13:49:52 sdorac2a kernel: md: considering cciss/c1d0p2 ...
Nov  4 13:49:52 sdorac2a kernel: md:  adding cciss/c1d0p2 ...
Nov  4 13:49:52 sdorac2a kernel: md: cciss/c1d0p1 has different UUID
to cciss/c1d0p2
Nov  4 13:49:52 sdorac2a kernel: md:  adding cciss/c0d0p2 ...
Nov  4 13:49:52 sdorac2a kernel: md: cciss/c0d0p1 has different UUID
to cciss/c1d0p2
Nov  4 13:49:52 sdorac2a kernel: md: created md1
Nov  4 13:49:52 sdorac2a kernel: md: bind<cciss/c0d0p2>
Nov  4 13:49:52 sdorac2a kernel: md: bind<cciss/c1d0p2>
Nov  4 13:49:52 sdorac2a kernel: md: running: <cciss/c1d0p2><cciss/c0d0p2>
Nov  4 13:49:52 sdorac2a kernel: raid1: raid set md1 active with 2 out
of 2 mirrors
Nov  4 13:49:52 sdorac2a kernel: md: considering cciss/c1d0p1 ...
Nov  4 13:49:52 sdorac2a kernel: md:  adding cciss/c1d0p1 ...
Nov  4 13:49:52 sdorac2a kernel: md:  adding cciss/c0d0p1 ...
Nov  4 13:49:52 sdorac2a kernel: md: created md0
Nov  4 13:49:52 sdorac2a kernel: md: bind<cciss/c0d0p1>
Nov  4 13:49:52 sdorac2a kernel: md: bind<cciss/c1d0p1>
Nov  4 13:49:53 sdorac2a kernel: md: running: <cciss/c1d0p1><cciss/c0d0p1>
Nov  4 13:49:53 sdorac2a kernel: raid1: raid set md0 active with 2 out
of 2 mirrors
Nov  4 13:49:53 sdorac2a kernel: md: ... autorun DONE.


And then the same for the system displaying the problem

On sdorac4b   (bad)

Nov  4 10:53:09 sdorac4b kernel: md: Autodetecting RAID arrays.
Nov  4 10:53:09 sdorac4b kernel: md: autorun ...
Nov  4 10:53:09 sdorac4b kernel: md: considering cciss/c0d0p2 ...
Nov  4 10:53:09 sdorac4b kernel: md:  adding cciss/c0d0p2 ...
Nov  4 10:53:09 sdorac4b kernel: md: cciss/c0d0p1 has different UUID
to cciss/c0d0p2
Nov  4 10:53:09 sdorac4b kernel: md: created md1
Nov  4 10:53:09 sdorac4b kernel: md: bind<cciss/c0d0p2>
Nov  4 10:53:09 sdorac4b kernel: md: running: <cciss/c0d0p2>
Nov  4 10:53:09 sdorac4b kernel: raid1: raid set md1 active with 1 out
of 2 mirrors
Nov  4 10:53:09 sdorac4b kernel: md: considering cciss/c0d0p1 ...
Nov  4 10:53:09 sdorac4b kernel: md:  adding cciss/c0d0p1 ...
Nov  4 10:53:09 sdorac4b kernel: md: created md0
Nov  4 10:53:09 sdorac4b kernel: md: bind<cciss/c0d0p1>
Nov  4 10:53:09 sdorac4b kernel: md: running: <cciss/c0d0p1>
Nov  4 10:53:09 sdorac4b kernel: raid1: raid set md0 active with 1 out
of 2 mirrors
Nov  4 10:53:09 sdorac4b kernel: md: ... autorun DONE.

It doesn't seem to even attempt to sniff out the md devices on
cciss/c1 at all.   When booted mdadm shows the missing device as being
removed.

[root@sdorac4b ~]# mdadm -QD /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Tue Jun 30 15:56:16 2009
     Raid Level : raid1
     Array Size : 305088 (297.99 MiB 312.41 MB)
  Used Dev Size : 305088 (297.99 MiB 312.41 MB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Nov  4 10:52:23 2009
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : 1e7d74f0:e7a63d85:bf9c3cc3:a1716192
         Events : 0.100

    Number   Major   Minor   RaidDevice State
       0     104        1        0      active sync   /dev/cciss/c0d0p1
       1       0        0        1      removed


But the UUID number for the missing device is the same, so surely when
it sniffs around for this UUID at boot time it should find and try the
missing device too:

#  mdadm -Esb /dev/cciss/c1d0p1
ARRAY /dev/md0 level=raid1 num-devices=2
UUID=1e7d74f0:e7a63d85:bf9c3cc3:a1716192


Same for the second raid device ...

[root@sdorac4b ~]# mdadm -QD /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Tue Jun 30 15:55:49 2009
     Raid Level : raid1
     Array Size : 143026624 (136.40 GiB 146.46 GB)
  Used Dev Size : 143026624 (136.40 GiB 146.46 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Wed Nov  4 15:27:03 2009
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : 38ea2ec7:eb1ea6b1:0fb9225f:defc1e17
         Events : 0.622028

    Number   Major   Minor   RaidDevice State
       0     104        2        0      active sync   /dev/cciss/c0d0p2
       1       0        0        1      removed


# mdadm -Esb /dev/cciss/c1d0p2
ARRAY /dev/md1 level=raid1 num-devices=2
UUID=38ea2ec7:eb1ea6b1:0fb9225f:defc1e17


Can anyone suggest anything else I can set/try to get more
information, or have insights based on previous experience?

Thanks,

John
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux