Re: Degraded RAID1

Curtis Vaughan <curtis@xxxxxxxxxxx> · Tue, 15 Oct 2019 14:44:34 -0700

On 15/10/2019 21:32, Curtis Vaughan wrote:
>>
>> On 10/14/19 5:44 PM, Wols Lists wrote:
>>> On 15/10/19 00:56, Curtis Vaughan wrote:
>>>> I have reason to believe one HD in a RAID1 is dying. But I'm trying to
>>>> understand what the logs and results of various commands are
>>>> telling me.
>>>> Searching on the Internet is very confusing. BTW, this is for and
>>>> Ubuntu
>>>> Server 18.04.2 LTS.
>>> Ubuntu LTS ... hmmm. What does "mdadm --version" tell you?
>> mdadm - v4.1-rc1 - 2018-03-22
>
> That's good. A lot of Ubuntu installations seem to have mdadm 3.4, and
> that has known issues, shall we say.
>>
>>>> It seems to me that the following information is telling me on
>>>> device is
>>>> missing. It would seem to me that sda is gone.
>>> Have you got any diagnostics for sda? Is it still in /dev? has it been
>>> kicked from the system? Or just the raid?
>> It is still on the system, so I guess it's just kicked off the raid.
>>>
>>> Take a look at
>>> https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
>>> Especially the "asking for help" page. It gives you loads of
>>> diagnostics
>>> to run.
>> Ok, I already went through most of those tests, but still not sure what
>> they are telling me.
>
> That's why the page tells you to post the output to this list ...
>
> I'm particularly interested in what mdadm --examine and --detail tell
> me about sda. But given that you think the drive is dud, does that
> really matter any more?
>
> However, there's a lot of reasons drives get kicked off, that may have
> absolutely nothing to do with them being faulty. Are your drives
> raid-friendly? Have you checked for the timeout problem? It's quite
> possible that the drive is fine, and you've just had a glitch.

mdadm --examine /dev/sda*
/dev/sda:
   MBR Magic : aa55
Partition[0] :     15622144 sectors at         2048 (type fd)
Partition[1] :   1937899520 sectors at     15624192 (type fd)
/dev/sda1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 7414ac79:580af0ce:e6bbe02b:915fa44a
  Creation Time : Wed Jul 18 15:00:44 2012
     Raid Level : raid1
  Used Dev Size : 7811008 (7.45 GiB 8.00 GB)
     Array Size : 7811008 (7.45 GiB 8.00 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0

    Update Time : Sun Oct 13 04:03:00 2019
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 9b87d9f8 - correct
         Events : 309

      Number   Major   Minor   RaidDevice State
this     0       8        1        0      active sync   /dev/sda1

   0     0       8        1        0      active sync   /dev/sda1
   1     1       8       17        1      active sync   /dev/sdb1
/dev/sda2:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : ac37ca92:939d7053:3b802bf3:08298597
  Creation Time : Wed Jul 18 15:00:53 2012
     Raid Level : raid1
  Used Dev Size : 968949696 (924.06 GiB 992.20 GB)
     Array Size : 968949696 (924.06 GiB 992.20 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1

    Update Time : Sun Oct 13 16:39:43 2019
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 141636b2 - correct
         Events : 4874

      Number   Major   Minor   RaidDevice State
this     0       8        2        0      active sync   /dev/sda2

   0     0       8        2        0      active sync   /dev/sda2
   1     1       8       18        1      active sync   /dev/sdb2

mdadm --detail /dev/md0
/dev/md0:
           Version : 0.90
     Creation Time : Wed Jul 18 15:00:44 2012
        Raid Level : raid1
        Array Size : 7811008 (7.45 GiB 8.00 GB)
     Used Dev Size : 7811008 (7.45 GiB 8.00 GB)
      Raid Devices : 2
     Total Devices : 1
   Preferred Minor : 0
       Persistence : Superblock is persistent

       Update Time : Tue Oct 15 14:31:20 2019
             State : clean, degraded
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              UUID : 7414ac79:580af0ce:e6bbe02b:915fa44a
            Events : 0.701

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       17        1      active sync   /dev/sdb1

mdadm --detail /dev/md1
/dev/md1:
           Version : 0.90
     Creation Time : Wed Jul 18 15:00:53 2012
        Raid Level : raid1
        Array Size : 968949696 (924.06 GiB 992.20 GB)
     Used Dev Size : 968949696 (924.06 GiB 992.20 GB)
      Raid Devices : 2
     Total Devices : 1
   Preferred Minor : 1
       Persistence : Superblock is persistent

       Update Time : Tue Oct 15 14:41:22 2019
             State : clean, degraded
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              UUID : ac37ca92:939d7053:3b802bf3:08298597
            Events : 0.85840

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       18        1      active sync   /dev/sdb2

>
>>> Basically, we need to know whether sda has died, or whether it's a
>>> problem with raid (especially with older mdadm, like I suspect you may
>>> have, the problem could lie there).
>>>> Anyhow, for example, I received an email:
>>>>
>>>> A DegradedArray event had been detected on md device /dev/md0.
>>> Do you have a spare drive to replace sda? If you haven't, it might
>>> be an
>>> idea to get one - especially if you think sda might have failed. In
>>> that
>>> case, fixing the raid should be pretty easy. So long as fixing it
>>> doesn't tip sdb over the edge ...
>>
>> The replacement drive is coming tomorrow. I'm certain now there's a
>> major issue
>>
>> with the drive and will be replacing it.
>
> What makes you think that?

Ran Spinrite against the drives. It gets about half way through the bad
drive and basically stops.

On the good drive it goes from beginning to end without issues.

Also after running smartctl on the bad drive, I got this:

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1568 Offline uncorrectable sectors

Device info:
ST1000DM003-9YN162, S/N:Z1D17B24, WWN:5-000c50-050e6c90f, FW:CC4C, 1.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.

And this:
The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1568 Currently unreadable (pending) sectors

Device info:
ST1000DM003-9YN162, S/N:Z1D17B24, WWN:5-000c50-050e6c90f, FW:CC4C, 1.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.

>>
>> My intent is to basically follow these instructions for replacing the
>> drive.
>>
>> sudo mdadm --remove /dev/md1 /dev/sda2
>> sudo mdadm --remove /dev/md0 /dev/sda1
>>
>> Remove the bad drive, install new drive
>>   sudo mdadm --add /dev/md1 /dev/sda2
>> sudo mdadm --add /dev/md1 /dev/sda1
>>
>>
>> Would that be the correct approach?
>
> Yup. Sounds good. The only thing that might make sense, especially if
> you're getting a slightly bigger drive to replace sda, look at putting
> dm-integrity between the partition and the raid. There's a good chance
> it's not available to you because it's a new feature, but the idea is
> that it checksums the writes. So if there's data corruption the raid
> no longer wonders which drive is correct, but the corruption triggers
> a read error and the raid will fix it. I can't give you any pointers
> sorry, because I haven't played with it myself, but you should be able
> to find some information here or elsewhere on it.

I order a drive that is exactly the same size as the failed drive. In
fact pretty much the same drive.

>>
>> Finally could you tell me how to subscribe to this newsgroup through
>> an NNTP client?
>> I was using it through the gmane server, which seems to have issues
>> of whether it's
>> being continued or not. And although I can see some recent posts from
>> last week,
>> there has been nothing new. I've been searching for a NNTP server,
>> but can't find one.
>> Thanks!
>>
> Dunno how to subscribe using an nntp client because this is a mailing
> list, but details are on the wiki home page. Click on the link to the
> mailing list, and you can subscribe to the list there.
>
> Cheers,
> Wol