Re: raid5 failed while rebuiling - classical problem

Janek Kozicki <janek_listy@xxxxx> · Fri, 2 Jul 2010 23:18:54 +0200

ok.. I can force assembly of this array:

  backup:~# mdadm -S /dev/md1
  mdadm: stopped /dev/md1
  backup:~# mdadm -Af /dev/md1 /dev/sda3 /dev/sdb3 /dev/hda3 /dev/sdd3
  mdadm: forcing event count in /dev/hda3(2) from 742923 upto 742950
  mdadm: clearing FAULTY flag for device 2 in /dev/md1 for /dev/hda3
  mdadm: /dev/md1 has been started with 4 drives (out of 5).

But hda3 still has a bad sector pending. So if I made a rebuild hda3
would get dropped out of the array again. I need to overwrite the bad
sector with zeros.

I have a free hdc3, which I am going to use for that:

  # dd bs=512 if=/dev/hda3 of=/dev/hdc3 conv=noerror,sync

The noerror tells dd to continue, even if it cannot read, and sync
tells to write zeros on hdc3 in that place (at least I hope that I
got the manpage right).

Then I will have an almost exact copy of bad hda3 on hdc3. Then I can
either dd it back, or maybe I could just --add hdc3 in place of hda3 ?

this will take some time to complete so let's wait.

best regards
Janek Kozicki

Janek Kozicki said:     (by the date of Fri, 2 Jul 2010 16:31:55 +0200)

> Hello,
> 
> 
> Following is a lengthy and full explanation of my problem. A short
> summary is at the end. It is possible that you already know the
> answer without reading this, then please just scroll down and help
> me ;-)
> 
> 
> I saw smartmontools reporting that sdc has 
> 
> Jul  1 10:19:52 backup_ smartd[2793]: Device: /dev/sdc, 1 Currently unreadable (pending) sectors  
> 
> My md1 layout was at that time following:
> 
> /dev/md1: 
>         Version : 01.01.03
>   Creation Time : Fri Nov  2 23:35:37 2007
>      Raid Level : raid5
>      Array Size : 1933614592 (1844.04 GiB 1980.02 GB)
>     Device Size : 966807296 (461.01 GiB 495.01 GB)
>    Raid Devices : 5
>   Total Devices : 5
> Preferred Minor : 1
>     Persistence : Superblock is persistent
> 
>   Intent Bitmap : Internal
> 
>     Update Time : Thu Jul  1 11:14:03 2010
>           State : active
>  Active Devices : 5
> Working Devices : 5
>  Failed Devices : 0
>   Spare Devices : 0
> 
>          Layout : left-symmetric
>      Chunk Size : 128K
> 
>            Name : backup:1  (local to host backup)
>            UUID : 22f22c35:99613d52:31d407a6:55bdeb84
>          Events : 718999
> 
>     Number   Major   Minor   RaidDevice State
>        5       8        3        0      active sync   /dev/sda3
>        1       8       19        1      active sync   /dev/sdb3
>        3       3        3        2      active sync   /dev/hda3
>        4       8       51        3      active sync   /dev/sdd3
>        6       8       35        4      active sync   /dev/sdc3
> 
> I wanted to get this corrected, so I ran following command, but it
> didn't help:
> 
>  /usr/share/mdadm/checkarray -a
> 
> So I wanted to test sdc more thoroughly, to get this fixed:
> 
> $ mdmadm --fail /dev/md1 /dev/sdc3
> mdadm: set /dev/sdc3 faulty in /dev/md1
> 
> $ mdmadm --remove /dev/md1 /dev/sdc3
> mdadm: hot removed /dev/sdc3
> 
> $ badblocks -c 10240 -s -w -t random -v /dev/sdc3
> Checking for bad blocks in read-write mode
> From block 0 to 483403882
> Testing with random pattern: done
> Reading and comparing: done
> Pass completed, 0 bad blocks found.
> 
> $ mdmadm --add /dev/md1 /dev/sdc3
> mdadm: added /dev/sdc3
> 
> $ mdadm -D /dev/md1
> /dev/md1: 
>         Version : 01.01.03
>   Creation Time : Fri Nov  2 23:35:37 2007
>      Raid Level : raid5
>      Array Size : 1933614592 (1844.04 GiB 1980.02 GB)
>     Device Size : 966807296 (461.01 GiB 495.01 GB)
>    Raid Devices : 5
>   Total Devices : 5
> Preferred Minor : 1
>     Persistence : Superblock is persistent
> 
>   Intent Bitmap : Internal
> 
>     Update Time : Thu Jul  1 18:19:47 2010
>           State : active, degraded, recovering
>  Active Devices : 4
> Working Devices : 5
>  Failed Devices : 0
>   Spare Devices : 1
> 
>          Layout : left-symmetric
>      Chunk Size : 128K
> 
>  Rebuild Status : 0% complete
> 
>            Name : backup:1  (local to host backup)
>            UUID : 22f22c35:99613d52:31d407a6:55bdeb84
>          Events : 733802
> 
>     Number   Major   Minor   RaidDevice State
>        5       8        3        0      active sync   /dev/sda3
>        1       8       19        1      active sync   /dev/sdb3
>        3       3        3        2      active sync   /dev/hda3
>        4       8       51        3      active sync   /dev/sdd3
>        6       8       35        4      spare rebuilding   /dev/sdc3
> 
> Ok, so I left it to rebuild. But then... hda3 failed.
> 
> I wasn't there when it happened. But this what I see in syslog:
> 
> Jul  1 22:34:49 backup_ mdadm: Rebuild60 event detected on md device /dev/md1 
> Jul  1 22:49:51 backup_ smartd[2793]: Device: /dev/hda, SMART Usage Attribute: 194 Temperature_Celsius changed from 122 to 121  
> Jul  1 22:49:51 backup_ smartd[2793]: Device: /dev/sda, SMART Usage Attribute: 194 Temperature_Celsius changed from 214 to 222  
> Jul  1 22:49:52 backup_ smartd[2793]: Device: /dev/sdd, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 118 to 108  
> Jul  1 22:49:52 backup_ smartd[2793]: Device: /dev/sdd, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 47 to 48  
> Jul  1 23:42:25 backup_ uptimed: moving up to position 33: 8 days, 05:23:01 
> Jul  1 23:49:51 backup_ smartd[2793]: Device: /dev/sda, SMART Usage Attribute: 194 Temperature_Celsius changed from 214 to 222  
> Jul  1 23:49:53 backup_ smartd[2793]: Device: /dev/sdd, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 115 to 117  
> Jul  1 23:49:53 backup_ smartd[2793]: Device: /dev/sdd, SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 69  
> Jul  1 23:49:53 backup_ smartd[2793]: Device: /dev/sdd, SMART Usage Attribute: 194 Temperature_Celsius changed from 32 to 31  
> Jul  1 23:49:53 backup_ smartd[2793]: Device: /dev/sdd, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 47 to 48  
> Jul  1 23:55:25 backup_ uptimed: moving up to position 32: 8 days, 05:36:01 
> Jul  1 23:58:07 backup_ kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } 
> Jul  1 23:58:07 backup_ kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=773056484, high=46, low=1304548, sector=773055468 
> Jul  1 23:58:07 backup_ kernel: ide: failed opcode was: unknown 
> Jul  1 23:58:07 backup_ kernel: end_request: I/O error, dev hda, sector 773055468 
> Jul  1 23:58:07 backup_ kernel: raid5:md1: read error not correctable (sector 763095168 on hda3). 
> Jul  1 23:58:07 backup_ kernel: raid5: Disk failure on hda3, disabling device. Operation continuing on 3 devices 
> Jul  1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544335 
> Jul  1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 
> Jul  1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544336 
> Jul  1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 
> Jul  1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544337 
> Jul  1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 
> Jul  1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544338 
> Jul  1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 
> Jul  1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544339 
> Jul  1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 
> Jul  1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544340 
> Jul  1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 
> Jul  1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544341 
> Jul  1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 
> Jul  1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544342 
> Jul  1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 
> Jul  1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544343 
> Jul  1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 
> Jul  1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544344 
> Jul  1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 
> Jul  1 23:58:10 backup_ kernel: Aborting journal on device md1. 
> Jul  1 23:58:10 backup_ kernel: md: md1: recovery done. 
> Jul  1 23:58:10 backup_ mdadm: Fail event detected on md device /dev/md1, component device /dev/hda3 
> Jul  1 23:58:10 backup_ kernel: ext3_abort called. 
> Jul  1 23:58:10 backup_ kernel: EXT3-fs error (device md1): ext3_journal_start_sb: Detected aborted journal 
> Jul  1 23:58:10 backup_ kernel: Remounting filesystem read-only 
> Jul  1 23:58:10 backup_ kernel: RAID5 conf printout: 
> Jul  1 23:58:10 backup_ kernel:  --- rd:5 wd:3 
> Jul  1 23:58:10 backup_ kernel:  disk 0, o:1, dev:sda3 
> Jul  1 23:58:10 backup_ kernel:  disk 1, o:1, dev:sdb3 
> Jul  1 23:58:10 backup_ kernel:  disk 2, o:0, dev:hda3 
> Jul  1 23:58:10 backup_ kernel:  disk 3, o:1, dev:sdd3 
> Jul  1 23:58:10 backup_ kernel:  disk 4, o:1, dev:sdc3 
> Jul  1 23:58:10 backup_ kernel: RAID5 conf printout: 
> Jul  1 23:58:10 backup_ kernel:  --- rd:5 wd:3 
> Jul  1 23:58:10 backup_ kernel:  disk 0, o:1, dev:sda3 
> Jul  1 23:58:10 backup_ kernel:  disk 1, o:1, dev:sdb3 
> Jul  1 23:58:10 backup_ kernel:  disk 2, o:0, dev:hda3 
> Jul  1 23:58:10 backup_ kernel:  disk 3, o:1, dev:sdd3 
> Jul  1 23:58:10 backup_ kernel: RAID5 conf printout: 
> Jul  1 23:58:10 backup_ kernel:  --- rd:5 wd:3 
> Jul  1 23:58:10 backup_ kernel:  disk 0, o:1, dev:sda3 
> Jul  1 23:58:10 backup_ kernel:  disk 1, o:1, dev:sdb3 
> Jul  1 23:58:10 backup_ kernel:  disk 2, o:0, dev:hda3 
> Jul  1 23:58:10 backup_ kernel:  disk 3, o:1, dev:sdd3 
> Jul  1 23:58:10 backup_ kernel: RAID5 conf printout: 
> Jul  1 23:58:10 backup_ kernel:  --- rd:5 wd:3 
> Jul  1 23:58:10 backup_ kernel:  disk 0, o:1, dev:sda3 
> Jul  1 23:58:10 backup_ kernel:  disk 1, o:1, dev:sdb3 
> Jul  1 23:58:10 backup_ kernel:  disk 3, o:1, dev:sdd3 
> Jul  1 23:58:10 backup_ mdadm: RebuildFinished event detected on md device /dev/md1 
> Jul  2 00:19:50 backup_ smartd[2793]: Device: /dev/hda, 1 Currently unreadable (pending) sectors  
> Jul  2 00:19:50 backup_ smartd[2793]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...  
> Jul  2 00:19:50 backup_ smartd[2793]: Warning via /usr/share/smartmontools/smartd-runner to root: successful  
> Jul  2 00:19:50 backup_ smartd[2793]: Device: /dev/hda, SMART Usage Attribute: 194 Temperature_Celsius changed from 121 to 122  
> Jul  2 00:19:50 backup_ smartd[2793]: Device: /dev/hda, ATA error count increased from 0 to 1  
> Jul  2 00:19:50 backup_ smartd[2793]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...  
> Jul  2 00:19:50 backup_ smartd[2793]: Warning via /usr/share/smartmontools/smartd-runner to root: successful  
> Jul  2 00:19:50 backup_ smartd[2793]: Device: /dev/sda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 94  
> Jul  2 00:19:51 backup_ smartd[2793]: Device: /dev/sdb, SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 69 to 70  
> Jul  2 00:19:51 backup_ smartd[2793]: Device: /dev/sdb, SMART Usage Attribute: 194 Temperature_Celsius changed from 31 to 30  
> Jul  2 00:19:51 backup_ smartd[2793]: Device: /dev/sdb, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 66 to 64  
> Jul  2 00:19:51 backup_ smartd[2793]: Device: /dev/sdc, SMART Usage Attribute: 194 Temperature_Celsius changed from 115 to 116  
> Jul  2 00:19:51 backup_ smartd[2793]: Device: /dev/sdd, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 118  
> Jul  2 00:19:51 backup_ smartd[2793]: Device: /dev/sdd, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 48 to 49  
> 
> 
> And so in the end I have /dev/hda, 1 Currently unreadable (pending)
> sectors, whereas in the beginning that was the error that got me
> worried about sdc. Except that this time my raid5 is down:
> 
> $ mdadm -D /dev/md1
> /dev/md1: 
>         Version : 01.01.03
>   Creation Time : Fri Nov  2 23:35:37 2007
>      Raid Level : raid5
>      Array Size : 1933614592 (1844.04 GiB 1980.02 GB)
>     Device Size : 966807296 (461.01 GiB 495.01 GB)
>    Raid Devices : 5
>   Total Devices : 5
> Preferred Minor : 1
>     Persistence : Superblock is persistent
> 
>   Intent Bitmap : Internal
> 
>     Update Time : Fri Jul  2 15:25:45 2010
>           State : active, degraded
>  Active Devices : 3
> Working Devices : 4
>  Failed Devices : 1
>   Spare Devices : 1
> 
>          Layout : left-symmetric
>      Chunk Size : 128K
> 
>            Name : backup:1  (local to host backup)
>            UUID : 22f22c35:99613d52:31d407a6:55bdeb84
>          Events : 742938
> 
>     Number   Major   Minor   RaidDevice State
>        5       8        3        0      active sync   /dev/sda3
>        1       8       19        1      active sync   /dev/sdb3
>        2       0        0        2      removed
>        4       8       51        3      active sync   /dev/sdd3
>        4       0        0        4      removed
> 
>        3       3        3        -      faulty spare   /dev/hda3
>        6       8       35        -      spare   /dev/sdc3
> 
> The filesystem /backup is still mounted, albeit readonly:
> 
> backup:~$ pydf
> Filesystem             Size   Used  Avail  Use%           Mounted on
> /dev/md/0              942M   836M    77M  88.8 [###### ] /
> /dev/md/1             1815G  1736G    24G  95.6 [#######] /backup
> udev                    10M   128k    10M   1.3 [       ] /dev
> tmpfs                  379M      0   379M   0.0 [       ] /dev/shm
> tmpfs                  379M      0   379M   0.0 [       ] /lib/init/rw
> 
> backup:~$ ls -la /backup/
> total 80
> drwxr-xr-x 29 root    root    4096 Jul  1 16:20 .
> drwxr-xr-x 24 root    root    4096 Mar  4 14:15 ..
> drwxr-xr-x 11 salomea salomea 4096 Jul  1 23:46 .mldonkey
> drwxr-xr-x 10 root    root    4096 Jul  1 14:59 .sync
> ?---------  ? ?       ?          ?            ? /backup/1_daily.4
> ?---------  ? ?       ?          ?            ? /backup/1_daily.5
> ?---------  ? ?       ?          ?            ? /backup/1_daily.6
> ?---------  ? ?       ?          ?            ? /backup/2_weekly.0
> ?---------  ? ?       ?          ?            ? /backup/3_monthly.3
> ?---------  ? ?       ?          ?            ? /backup/3_monthly.5
> ?---------  ? ?       ?          ?            ? /backup/lost+found
> drwxr-xr-x 10 root    root    4096 Jul  1 14:59 0_hourly.0
> drwxr-xr-x 10 root    root    4096 Jul  1 02:23 0_hourly.1
> drwxr-xr-x 10 root    root    4096 Jun 30 18:31 0_hourly.2
> drwxr-xr-x 10 root    root    4096 Jun 30 14:54 1_daily.0
> drwxr-xr-x 10 root    root    4096 Jun 29 14:50 1_daily.1
> drwxr-xr-x 10 root    root    4096 Jun 28 14:21 1_daily.2
> drwxr-xr-x 10 root    root    4096 Jun 26 14:34 1_daily.3
> drwxr-xr-x 10 root    root    4096 Jun 16 15:00 2_weekly.1
> drwxr-xr-x 10 root    root    4096 Jun  8 14:33 2_weekly.2
> drwxr-xr-x 10 root    root    4096 Jun  1 14:40 2_weekly.3
> drwxr-xr-x 10 root    root    4096 May 24 14:32 3_monthly.0
> drwxr-xr-x 10 root    root    4096 Apr 17 14:59 3_monthly.1
> drwxr-xr-x 10 root    root    4096 Mar 24 02:19 3_monthly.2
> drwxr-xr-x 10 root    root    4096 Jan 23 10:50 3_monthly.4
> 
> I suppose that this is a classic situation. I decided to wait for
> your help before making it worse. sdc is irrecoverable, because I
> tested it in read-write mode. But hda has been just removed, only due
> to 1 damaged sector. I would much prefer to have all 2 TB intact with
> just several kilobytes lost, than losing whole partition.
> 
> Please help me to assemble this array back, and I will instantly
> replace hda with another 500GB drive that I have just
> bought: /dev/hdc3 is ready, fresh from the shop and waiting. I just
> don't know how to add it to the array now.
> 
> To summarize:
> 
> - /dev/md1 had following layout beforehand:
> 
>     Number   Major   Minor   RaidDevice State
>        5       8        3        0      active sync   /dev/sda3
>        1       8       19        1      active sync   /dev/sdb3
>        3       3        3        2      active sync   /dev/hda3
>        4       8       51        3      active sync   /dev/sdd3
>        6       8       35        4      active sync   /dev/sdc3
> 
> - now it has following layout:
> 
>     Number   Major   Minor   RaidDevice State
>        5       8        3        0      active sync   /dev/sda3
>        1       8       19        1      active sync   /dev/sdb3
>        2       0        0        2      removed
>        4       8       51        3      active sync   /dev/sdd3
>        4       0        0        4      removed
> 
>        3       3        3        -      faulty spare   /dev/hda3
>        6       8       35        -      spare   /dev/sdc3
> 
> - now /dev/sdc3 is after read-write test, thus unusable, although in good condition
> - now /dev/hda3 contains valid filesystem data, except for one
>   damaged sector, which kicked it out of the array.
> 
> 1. I would like to reassamble the array using (a little damaged) /dev/hda3
> 2. then add a newly purchased /dev/hdc3, remove /dev/hda3 and add /dev/sdc3 too
> 
> Next I plan to migrate all that into a raid6 configuration. But let's first fix this.
> 
> If you need any more information please let me know. I have /proc/mdstat
> output also, from before the problem, and after the problem.
> 
> I am sorry, that I didn't upgrade my backup server yet:
> 
> backup:~# uname -a
> Linux backup 2.6.24-etchnhalf.1-686 #1 SMP Sat Aug 15 16:51:49 UTC 2009 i686 GNU/Linux
> 
> After we fix this I plan to upgrade to debian squeeze and use latest kernel.
> 
> best regards
> -- 
> Janek Kozicki                               http://janek.kozicki.pl/  |
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Janek Kozicki                               http://janek.kozicki.pl/  |
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html