ok.. I can force assembly of this array: backup:~# mdadm -S /dev/md1 mdadm: stopped /dev/md1 backup:~# mdadm -Af /dev/md1 /dev/sda3 /dev/sdb3 /dev/hda3 /dev/sdd3 mdadm: forcing event count in /dev/hda3(2) from 742923 upto 742950 mdadm: clearing FAULTY flag for device 2 in /dev/md1 for /dev/hda3 mdadm: /dev/md1 has been started with 4 drives (out of 5). But hda3 still has a bad sector pending. So if I made a rebuild hda3 would get dropped out of the array again. I need to overwrite the bad sector with zeros. I have a free hdc3, which I am going to use for that: # dd bs=512 if=/dev/hda3 of=/dev/hdc3 conv=noerror,sync The noerror tells dd to continue, even if it cannot read, and sync tells to write zeros on hdc3 in that place (at least I hope that I got the manpage right). Then I will have an almost exact copy of bad hda3 on hdc3. Then I can either dd it back, or maybe I could just --add hdc3 in place of hda3 ? this will take some time to complete so let's wait. best regards Janek Kozicki Janek Kozicki said: (by the date of Fri, 2 Jul 2010 16:31:55 +0200) > Hello, > > > Following is a lengthy and full explanation of my problem. A short > summary is at the end. It is possible that you already know the > answer without reading this, then please just scroll down and help > me ;-) > > > I saw smartmontools reporting that sdc has > > Jul 1 10:19:52 backup_ smartd[2793]: Device: /dev/sdc, 1 Currently unreadable (pending) sectors > > My md1 layout was at that time following: > > /dev/md1: > Version : 01.01.03 > Creation Time : Fri Nov 2 23:35:37 2007 > Raid Level : raid5 > Array Size : 1933614592 (1844.04 GiB 1980.02 GB) > Device Size : 966807296 (461.01 GiB 495.01 GB) > Raid Devices : 5 > Total Devices : 5 > Preferred Minor : 1 > Persistence : Superblock is persistent > > Intent Bitmap : Internal > > Update Time : Thu Jul 1 11:14:03 2010 > State : active > Active Devices : 5 > Working Devices : 5 > Failed Devices : 0 > Spare Devices : 0 > > Layout : left-symmetric > Chunk Size : 128K > > Name : backup:1 (local to host backup) > UUID : 22f22c35:99613d52:31d407a6:55bdeb84 > Events : 718999 > > Number Major Minor RaidDevice State > 5 8 3 0 active sync /dev/sda3 > 1 8 19 1 active sync /dev/sdb3 > 3 3 3 2 active sync /dev/hda3 > 4 8 51 3 active sync /dev/sdd3 > 6 8 35 4 active sync /dev/sdc3 > > I wanted to get this corrected, so I ran following command, but it > didn't help: > > /usr/share/mdadm/checkarray -a > > So I wanted to test sdc more thoroughly, to get this fixed: > > $ mdmadm --fail /dev/md1 /dev/sdc3 > mdadm: set /dev/sdc3 faulty in /dev/md1 > > $ mdmadm --remove /dev/md1 /dev/sdc3 > mdadm: hot removed /dev/sdc3 > > $ badblocks -c 10240 -s -w -t random -v /dev/sdc3 > Checking for bad blocks in read-write mode > From block 0 to 483403882 > Testing with random pattern: done > Reading and comparing: done > Pass completed, 0 bad blocks found. > > $ mdmadm --add /dev/md1 /dev/sdc3 > mdadm: added /dev/sdc3 > > $ mdadm -D /dev/md1 > /dev/md1: > Version : 01.01.03 > Creation Time : Fri Nov 2 23:35:37 2007 > Raid Level : raid5 > Array Size : 1933614592 (1844.04 GiB 1980.02 GB) > Device Size : 966807296 (461.01 GiB 495.01 GB) > Raid Devices : 5 > Total Devices : 5 > Preferred Minor : 1 > Persistence : Superblock is persistent > > Intent Bitmap : Internal > > Update Time : Thu Jul 1 18:19:47 2010 > State : active, degraded, recovering > Active Devices : 4 > Working Devices : 5 > Failed Devices : 0 > Spare Devices : 1 > > Layout : left-symmetric > Chunk Size : 128K > > Rebuild Status : 0% complete > > Name : backup:1 (local to host backup) > UUID : 22f22c35:99613d52:31d407a6:55bdeb84 > Events : 733802 > > Number Major Minor RaidDevice State > 5 8 3 0 active sync /dev/sda3 > 1 8 19 1 active sync /dev/sdb3 > 3 3 3 2 active sync /dev/hda3 > 4 8 51 3 active sync /dev/sdd3 > 6 8 35 4 spare rebuilding /dev/sdc3 > > Ok, so I left it to rebuild. But then... hda3 failed. > > I wasn't there when it happened. But this what I see in syslog: > > Jul 1 22:34:49 backup_ mdadm: Rebuild60 event detected on md device /dev/md1 > Jul 1 22:49:51 backup_ smartd[2793]: Device: /dev/hda, SMART Usage Attribute: 194 Temperature_Celsius changed from 122 to 121 > Jul 1 22:49:51 backup_ smartd[2793]: Device: /dev/sda, SMART Usage Attribute: 194 Temperature_Celsius changed from 214 to 222 > Jul 1 22:49:52 backup_ smartd[2793]: Device: /dev/sdd, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 118 to 108 > Jul 1 22:49:52 backup_ smartd[2793]: Device: /dev/sdd, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 47 to 48 > Jul 1 23:42:25 backup_ uptimed: moving up to position 33: 8 days, 05:23:01 > Jul 1 23:49:51 backup_ smartd[2793]: Device: /dev/sda, SMART Usage Attribute: 194 Temperature_Celsius changed from 214 to 222 > Jul 1 23:49:53 backup_ smartd[2793]: Device: /dev/sdd, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 115 to 117 > Jul 1 23:49:53 backup_ smartd[2793]: Device: /dev/sdd, SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 69 > Jul 1 23:49:53 backup_ smartd[2793]: Device: /dev/sdd, SMART Usage Attribute: 194 Temperature_Celsius changed from 32 to 31 > Jul 1 23:49:53 backup_ smartd[2793]: Device: /dev/sdd, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 47 to 48 > Jul 1 23:55:25 backup_ uptimed: moving up to position 32: 8 days, 05:36:01 > Jul 1 23:58:07 backup_ kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } > Jul 1 23:58:07 backup_ kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=773056484, high=46, low=1304548, sector=773055468 > Jul 1 23:58:07 backup_ kernel: ide: failed opcode was: unknown > Jul 1 23:58:07 backup_ kernel: end_request: I/O error, dev hda, sector 773055468 > Jul 1 23:58:07 backup_ kernel: raid5:md1: read error not correctable (sector 763095168 on hda3). > Jul 1 23:58:07 backup_ kernel: raid5: Disk failure on hda3, disabling device. Operation continuing on 3 devices > Jul 1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544335 > Jul 1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 > Jul 1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544336 > Jul 1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 > Jul 1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544337 > Jul 1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 > Jul 1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544338 > Jul 1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 > Jul 1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544339 > Jul 1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 > Jul 1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544340 > Jul 1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 > Jul 1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544341 > Jul 1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 > Jul 1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544342 > Jul 1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 > Jul 1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544343 > Jul 1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 > Jul 1 23:58:10 backup_ kernel: Buffer I/O error on device md1, logical block 26544344 > Jul 1 23:58:10 backup_ kernel: lost page write due to I/O error on md1 > Jul 1 23:58:10 backup_ kernel: Aborting journal on device md1. > Jul 1 23:58:10 backup_ kernel: md: md1: recovery done. > Jul 1 23:58:10 backup_ mdadm: Fail event detected on md device /dev/md1, component device /dev/hda3 > Jul 1 23:58:10 backup_ kernel: ext3_abort called. > Jul 1 23:58:10 backup_ kernel: EXT3-fs error (device md1): ext3_journal_start_sb: Detected aborted journal > Jul 1 23:58:10 backup_ kernel: Remounting filesystem read-only > Jul 1 23:58:10 backup_ kernel: RAID5 conf printout: > Jul 1 23:58:10 backup_ kernel: --- rd:5 wd:3 > Jul 1 23:58:10 backup_ kernel: disk 0, o:1, dev:sda3 > Jul 1 23:58:10 backup_ kernel: disk 1, o:1, dev:sdb3 > Jul 1 23:58:10 backup_ kernel: disk 2, o:0, dev:hda3 > Jul 1 23:58:10 backup_ kernel: disk 3, o:1, dev:sdd3 > Jul 1 23:58:10 backup_ kernel: disk 4, o:1, dev:sdc3 > Jul 1 23:58:10 backup_ kernel: RAID5 conf printout: > Jul 1 23:58:10 backup_ kernel: --- rd:5 wd:3 > Jul 1 23:58:10 backup_ kernel: disk 0, o:1, dev:sda3 > Jul 1 23:58:10 backup_ kernel: disk 1, o:1, dev:sdb3 > Jul 1 23:58:10 backup_ kernel: disk 2, o:0, dev:hda3 > Jul 1 23:58:10 backup_ kernel: disk 3, o:1, dev:sdd3 > Jul 1 23:58:10 backup_ kernel: RAID5 conf printout: > Jul 1 23:58:10 backup_ kernel: --- rd:5 wd:3 > Jul 1 23:58:10 backup_ kernel: disk 0, o:1, dev:sda3 > Jul 1 23:58:10 backup_ kernel: disk 1, o:1, dev:sdb3 > Jul 1 23:58:10 backup_ kernel: disk 2, o:0, dev:hda3 > Jul 1 23:58:10 backup_ kernel: disk 3, o:1, dev:sdd3 > Jul 1 23:58:10 backup_ kernel: RAID5 conf printout: > Jul 1 23:58:10 backup_ kernel: --- rd:5 wd:3 > Jul 1 23:58:10 backup_ kernel: disk 0, o:1, dev:sda3 > Jul 1 23:58:10 backup_ kernel: disk 1, o:1, dev:sdb3 > Jul 1 23:58:10 backup_ kernel: disk 3, o:1, dev:sdd3 > Jul 1 23:58:10 backup_ mdadm: RebuildFinished event detected on md device /dev/md1 > Jul 2 00:19:50 backup_ smartd[2793]: Device: /dev/hda, 1 Currently unreadable (pending) sectors > Jul 2 00:19:50 backup_ smartd[2793]: Sending warning via /usr/share/smartmontools/smartd-runner to root ... > Jul 2 00:19:50 backup_ smartd[2793]: Warning via /usr/share/smartmontools/smartd-runner to root: successful > Jul 2 00:19:50 backup_ smartd[2793]: Device: /dev/hda, SMART Usage Attribute: 194 Temperature_Celsius changed from 121 to 122 > Jul 2 00:19:50 backup_ smartd[2793]: Device: /dev/hda, ATA error count increased from 0 to 1 > Jul 2 00:19:50 backup_ smartd[2793]: Sending warning via /usr/share/smartmontools/smartd-runner to root ... > Jul 2 00:19:50 backup_ smartd[2793]: Warning via /usr/share/smartmontools/smartd-runner to root: successful > Jul 2 00:19:50 backup_ smartd[2793]: Device: /dev/sda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 94 > Jul 2 00:19:51 backup_ smartd[2793]: Device: /dev/sdb, SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 69 to 70 > Jul 2 00:19:51 backup_ smartd[2793]: Device: /dev/sdb, SMART Usage Attribute: 194 Temperature_Celsius changed from 31 to 30 > Jul 2 00:19:51 backup_ smartd[2793]: Device: /dev/sdb, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 66 to 64 > Jul 2 00:19:51 backup_ smartd[2793]: Device: /dev/sdc, SMART Usage Attribute: 194 Temperature_Celsius changed from 115 to 116 > Jul 2 00:19:51 backup_ smartd[2793]: Device: /dev/sdd, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 118 > Jul 2 00:19:51 backup_ smartd[2793]: Device: /dev/sdd, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 48 to 49 > > > And so in the end I have /dev/hda, 1 Currently unreadable (pending) > sectors, whereas in the beginning that was the error that got me > worried about sdc. Except that this time my raid5 is down: > > $ mdadm -D /dev/md1 > /dev/md1: > Version : 01.01.03 > Creation Time : Fri Nov 2 23:35:37 2007 > Raid Level : raid5 > Array Size : 1933614592 (1844.04 GiB 1980.02 GB) > Device Size : 966807296 (461.01 GiB 495.01 GB) > Raid Devices : 5 > Total Devices : 5 > Preferred Minor : 1 > Persistence : Superblock is persistent > > Intent Bitmap : Internal > > Update Time : Fri Jul 2 15:25:45 2010 > State : active, degraded > Active Devices : 3 > Working Devices : 4 > Failed Devices : 1 > Spare Devices : 1 > > Layout : left-symmetric > Chunk Size : 128K > > Name : backup:1 (local to host backup) > UUID : 22f22c35:99613d52:31d407a6:55bdeb84 > Events : 742938 > > Number Major Minor RaidDevice State > 5 8 3 0 active sync /dev/sda3 > 1 8 19 1 active sync /dev/sdb3 > 2 0 0 2 removed > 4 8 51 3 active sync /dev/sdd3 > 4 0 0 4 removed > > 3 3 3 - faulty spare /dev/hda3 > 6 8 35 - spare /dev/sdc3 > > The filesystem /backup is still mounted, albeit readonly: > > backup:~$ pydf > Filesystem Size Used Avail Use% Mounted on > /dev/md/0 942M 836M 77M 88.8 [###### ] / > /dev/md/1 1815G 1736G 24G 95.6 [#######] /backup > udev 10M 128k 10M 1.3 [ ] /dev > tmpfs 379M 0 379M 0.0 [ ] /dev/shm > tmpfs 379M 0 379M 0.0 [ ] /lib/init/rw > > backup:~$ ls -la /backup/ > total 80 > drwxr-xr-x 29 root root 4096 Jul 1 16:20 . > drwxr-xr-x 24 root root 4096 Mar 4 14:15 .. > drwxr-xr-x 11 salomea salomea 4096 Jul 1 23:46 .mldonkey > drwxr-xr-x 10 root root 4096 Jul 1 14:59 .sync > ?--------- ? ? ? ? ? /backup/1_daily.4 > ?--------- ? ? ? ? ? /backup/1_daily.5 > ?--------- ? ? ? ? ? /backup/1_daily.6 > ?--------- ? ? ? ? ? /backup/2_weekly.0 > ?--------- ? ? ? ? ? /backup/3_monthly.3 > ?--------- ? ? ? ? ? /backup/3_monthly.5 > ?--------- ? ? ? ? ? /backup/lost+found > drwxr-xr-x 10 root root 4096 Jul 1 14:59 0_hourly.0 > drwxr-xr-x 10 root root 4096 Jul 1 02:23 0_hourly.1 > drwxr-xr-x 10 root root 4096 Jun 30 18:31 0_hourly.2 > drwxr-xr-x 10 root root 4096 Jun 30 14:54 1_daily.0 > drwxr-xr-x 10 root root 4096 Jun 29 14:50 1_daily.1 > drwxr-xr-x 10 root root 4096 Jun 28 14:21 1_daily.2 > drwxr-xr-x 10 root root 4096 Jun 26 14:34 1_daily.3 > drwxr-xr-x 10 root root 4096 Jun 16 15:00 2_weekly.1 > drwxr-xr-x 10 root root 4096 Jun 8 14:33 2_weekly.2 > drwxr-xr-x 10 root root 4096 Jun 1 14:40 2_weekly.3 > drwxr-xr-x 10 root root 4096 May 24 14:32 3_monthly.0 > drwxr-xr-x 10 root root 4096 Apr 17 14:59 3_monthly.1 > drwxr-xr-x 10 root root 4096 Mar 24 02:19 3_monthly.2 > drwxr-xr-x 10 root root 4096 Jan 23 10:50 3_monthly.4 > > I suppose that this is a classic situation. I decided to wait for > your help before making it worse. sdc is irrecoverable, because I > tested it in read-write mode. But hda has been just removed, only due > to 1 damaged sector. I would much prefer to have all 2 TB intact with > just several kilobytes lost, than losing whole partition. > > Please help me to assemble this array back, and I will instantly > replace hda with another 500GB drive that I have just > bought: /dev/hdc3 is ready, fresh from the shop and waiting. I just > don't know how to add it to the array now. > > To summarize: > > - /dev/md1 had following layout beforehand: > > Number Major Minor RaidDevice State > 5 8 3 0 active sync /dev/sda3 > 1 8 19 1 active sync /dev/sdb3 > 3 3 3 2 active sync /dev/hda3 > 4 8 51 3 active sync /dev/sdd3 > 6 8 35 4 active sync /dev/sdc3 > > - now it has following layout: > > Number Major Minor RaidDevice State > 5 8 3 0 active sync /dev/sda3 > 1 8 19 1 active sync /dev/sdb3 > 2 0 0 2 removed > 4 8 51 3 active sync /dev/sdd3 > 4 0 0 4 removed > > 3 3 3 - faulty spare /dev/hda3 > 6 8 35 - spare /dev/sdc3 > > - now /dev/sdc3 is after read-write test, thus unusable, although in good condition > - now /dev/hda3 contains valid filesystem data, except for one > damaged sector, which kicked it out of the array. > > 1. I would like to reassamble the array using (a little damaged) /dev/hda3 > 2. then add a newly purchased /dev/hdc3, remove /dev/hda3 and add /dev/sdc3 too > > Next I plan to migrate all that into a raid6 configuration. But let's first fix this. > > If you need any more information please let me know. I have /proc/mdstat > output also, from before the problem, and after the problem. > > I am sorry, that I didn't upgrade my backup server yet: > > backup:~# uname -a > Linux backup 2.6.24-etchnhalf.1-686 #1 SMP Sat Aug 15 16:51:49 UTC 2009 i686 GNU/Linux > > After we fix this I plan to upgrade to debian squeeze and use latest kernel. > > best regards > -- > Janek Kozicki http://janek.kozicki.pl/ | > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Janek Kozicki http://janek.kozicki.pl/ | -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html