Hi all.
I just wanted to check with those more clued up that I'm not missing
something important. To save you wading through the logs, in summary my
filesystem got borked after I recovered an array when I realised I'd
used the device and not the partition and corrected the mistake. I'd
like to know if it's just bad luck and is likely to be a ram corruption
of some sort, or if I did something wrong with my actions as I was under
the impression the validation would fail the disk if there was a
problem? I've done this before and it all went fine...
I have a remote server running Debian Testing...of relevance are the 3
SATA drives on a LSI SAS controller and the RAID5 in use on them, though
it also has two 2GB compactflash cards running in RAID1 with the root
filesystem on them.
At the weekend I pulled one of the drives to use elsewhere. I replaced
it with a new larger drive (old array drives are 500Gb, new one is 1.5TB
- I'll be getting more in due course to enlarge the array), added a new
partition over the whole disk, and added it to the array, which then
began to rebuild:
Old disk:
Disk /dev/sdc: 499.3 GB, 499279462400 bytes
255 heads, 63 sectors/track, 60700 cylinders, total 975155200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x5932fff1
Device Boot Start End Blocks Id System
/dev/sdc1 63 975145499 487572718+ fd Linux raid
autodetect
New disk:
Disk /dev/sdd: 1500.3 GB, 1500301910016 bytes
81 heads, 63 sectors/track, 574226 cylinders, total 2930277168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xbcba78d3
Device Boot Start End Blocks Id System
/dev/sdf1 2048 2930277167 1465137560 fd Linux raid
autodetect
<pulls disk>
Feb 25 18:41:33 mrlinux kernel: [281921.556280] RAID conf printout:
Feb 25 18:41:33 mrlinux kernel: [281921.556288] --- level:5 rd:3 wd:2
Feb 25 18:41:33 mrlinux kernel: [281921.556294] disk 0, o:1, dev:sde1
Feb 25 18:41:33 mrlinux kernel: [281921.556299] disk 1, o:1, dev:sdc1
Feb 25 18:41:33 mrlinux kernel: [281921.556304] disk 2, o:0, dev:sdd1
Feb 25 18:41:33 mrlinux kernel: [281921.620023] RAID conf printout:
Feb 25 18:41:33 mrlinux kernel: [281921.620029] --- level:5 rd:3 wd:2
Feb 25 18:41:33 mrlinux kernel: [281921.620033] disk 0, o:1, dev:sde1
Feb 25 18:41:33 mrlinux kernel: [281921.620036] disk 1, o:1, dev:sdc1
<inserts new disk>
Feb 25 18:56:22 mrlinux kernel: [282809.983189] mptsas: ioc0:
attaching sata device: fw_channel 0, fw_id 2, phy 0, sas_addr 0
x455c3c41ddbfae97
Feb 25 18:56:22 mrlinux kernel: [282809.994670] scsi 5:0:2:0:
Direct-Access ATA WDC WD15EARS-00M AB51 PQ: 0 ANSI: 5
Feb 25 18:56:22 mrlinux kernel: [282809.996926] sd 5:0:2:0: [sdf]
2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB)
Feb 25 18:56:22 mrlinux kernel: [282810.026435] sd 5:0:2:0: [sdf]
Write Protect is off
Feb 25 18:56:22 mrlinux kernel: [282810.026442] sd 5:0:2:0: [sdf] Mode
Sense: 73 00 00 08
Feb 25 18:56:22 mrlinux kernel: [282810.037183] sd 5:0:2:0: [sdf]
Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
Feb 25 18:56:22 mrlinux kernel: [282810.098805] sdf: unknown
partition table
Feb 25 18:56:22 mrlinux kernel: [282810.159468] sd 5:0:2:0: [sdf]
Attached SCSI disk
<partitions new disk>
Feb 25 18:59:08 mrlinux kernel: [282976.397358] sdf: sdf1
<adds to array>
Feb 25 18:59:42 mrlinux kernel: [283010.226883] md: bind<sdf>
Feb 25 18:59:42 mrlinux kernel: [283010.266108] RAID conf printout:
Feb 25 18:59:42 mrlinux kernel: [283010.266116] --- level:5 rd:3 wd:2
Feb 25 18:59:42 mrlinux kernel: [283010.266122] disk 0, o:1, dev:sde1
Feb 25 18:59:42 mrlinux kernel: [283010.266127] disk 1, o:1, dev:sdc1
Feb 25 18:59:42 mrlinux kernel: [283010.266132] disk 2, o:1, dev:sdf
Feb 25 18:59:42 mrlinux kernel: [283010.266226] md: recovery of RAID
array md1
Feb 25 18:59:42 mrlinux kernel: [283010.266235] md: minimum
_guaranteed_ speed: 1000 KB/sec/disk.
Feb 25 18:59:42 mrlinux kernel: [283010.266240] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
recovery.
Feb 25 18:59:42 mrlinux kernel: [283010.266255] md: using 128k window,
over a total of 487572608k.
<recovers 500GB in 03:56:38>
Feb 25 22:56:20 mrlinux kernel: [297208.335405] md: md1: recovery done.
Feb 25 22:56:20 mrlinux kernel: [297208.601814] RAID conf printout:
Feb 25 22:56:20 mrlinux kernel: [297208.601822] --- level:5 rd:3 wd:3
Feb 25 22:56:20 mrlinux kernel: [297208.601828] disk 0, o:1, dev:sde1
Feb 25 22:56:20 mrlinux kernel: [297208.601833] disk 1, o:1, dev:sdc1
Feb 25 22:56:20 mrlinux kernel: [297208.601837] disk 2, o:1, dev:sdf
The next day I noticed I'd made a mistake - I'd added /dev/sdf and not
/dev/sdf1. I've had this before and it confused the system during boot
as it saw two superblocks, so I failed the disk, removed it from the
array, then recreated the partition table that had been obliterated. I
then re-added the disk (or, more accurately, the partition) to the array:
Feb 26 09:50:52 mrlinux kernel: [336480.481487] md: cannot remove
active disk sdf from md1 ...
<remembers you have to fail a disk before you can remove it...and does so>
Feb 26 09:51:47 mrlinux kernel: [336535.564743] md/raid:md1: Disk
failure on sdf, disabling device.
Feb 26 09:51:47 mrlinux kernel: [336535.564746] md/raid:md1: Operation
continuing on 2 devices.
Feb 26 09:51:47 mrlinux kernel: [336535.597393] RAID conf printout:
Feb 26 09:51:47 mrlinux kernel: [336535.597399] --- level:5 rd:3 wd:2
Feb 26 09:51:47 mrlinux kernel: [336535.597403] disk 0, o:1, dev:sde1
Feb 26 09:51:47 mrlinux kernel: [336535.597406] disk 1, o:1, dev:sdc1
Feb 26 09:51:47 mrlinux kernel: [336535.597409] disk 2, o:0, dev:sdf
Feb 26 09:51:48 mrlinux kernel: [336535.664011] RAID conf printout:
Feb 26 09:51:48 mrlinux kernel: [336535.664017] --- level:5 rd:3 wd:2
Feb 26 09:51:48 mrlinux kernel: [336535.664020] disk 0, o:1, dev:sde1
Feb 26 09:51:48 mrlinux kernel: [336535.664023] disk 1, o:1, dev:sdc1
<removes disk from array>
Feb 26 09:52:11 mrlinux kernel: [336558.705730] md: unbind<sdf>
Feb 26 09:52:11 mrlinux kernel: [336558.769675] md: export_rdev(sdf)
<repartitions disk>
Feb 26 09:54:06 mrlinux kernel: [336674.474874] sdf: sdf1
<re-adds disk to array>
Feb 26 09:54:19 mrlinux kernel: [336687.596415] md: bind<sdf1>
Feb 26 09:54:19 mrlinux kernel: [336687.636078] RAID conf printout:
Feb 26 09:54:19 mrlinux kernel: [336687.636087] --- level:5 rd:3 wd:2
Feb 26 09:54:19 mrlinux kernel: [336687.636094] disk 0, o:1, dev:sde1
Feb 26 09:54:19 mrlinux kernel: [336687.636099] disk 1, o:1, dev:sdc1
Feb 26 09:54:19 mrlinux kernel: [336687.636105] disk 2, o:1, dev:sdf1
Feb 26 09:54:19 mrlinux kernel: [336687.636308] md: recovery of RAID
array md1
Feb 26 09:54:19 mrlinux kernel: [336687.636317] md: minimum
_guaranteed_ speed: 1000 KB/sec/disk.
Feb 26 09:54:19 mrlinux kernel: [336687.636322] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
recovery.
Feb 26 09:54:19 mrlinux kernel: [336687.636342] md: using 128k window,
over a total of 487572608k.
<recovers 500Gb in 00:00:09>
Feb 26 09:54:28 mrlinux kernel: [336696.647039] md: md1: recovery done.
Feb 26 09:54:29 mrlinux kernel: [336696.726098] RAID conf printout:
Feb 26 09:54:29 mrlinux kernel: [336696.726106] --- level:5 rd:3 wd:3
Feb 26 09:54:29 mrlinux kernel: [336696.726112] disk 0, o:1, dev:sde1
Feb 26 09:54:29 mrlinux kernel: [336696.726117] disk 1, o:1, dev:sdc1
Feb 26 09:54:29 mrlinux kernel: [336696.726122] disk 2, o:1, dev:sdf1
Here's the strange thing, it recovered very quickly, which I thought was
nice, but I wonder if it created a problem, as about an hour later I
started getting errors in my logs:
Feb 26 10:34:05 mrlinux kernel: [339073.432724] attempt to access
beyond end of device
Feb 26 10:34:05 mrlinux kernel: [339073.432732] dm-1: rw=0,
want=15613016656, limit=58589184
Feb 26 10:34:05 mrlinux kernel: [339073.533896] attempt to access
beyond end of device
Feb 26 10:34:05 mrlinux kernel: [339073.533905] dm-0: rw=0,
want=1681676296, limit=97648640
Feb 26 10:34:05 mrlinux kernel: [339073.533916] attempt to access
beyond end of device
Feb 26 10:34:05 mrlinux kernel: [339073.533920] dm-0: rw=0,
want=18656264368, limit=97648640
Feb 26 10:34:05 mrlinux kernel: [339073.533945] attempt to access
beyond end of device
Feb 26 10:34:05 mrlinux kernel: [339073.533950] dm-0: rw=0,
want=18289707016, limit=97648640
Feb 26 10:34:05 mrlinux kernel: [339073.533955] attempt to access
beyond end of device
...and eventually:
Feb 26 10:34:05 mrlinux kernel: [339073.534443] dm-0: rw=0,
want=16783872552, limit=97648640
Feb 26 10:34:05 mrlinux kernel: [339073.534447] attempt to access
beyond end of device
Feb 26 10:34:05 mrlinux kernel: [339073.534450] dm-0: rw=0,
want=17686087704, limit=97648640
Feb 26 10:34:05 mrlinux kernel: [339073.534515] attempt to access
beyond end of device
Feb 26 10:34:05 mrlinux kernel: [339073.534520] dm-0: rw=0,
want=16716398592, limit=97648640
Feb 26 10:37:57 mrlinux kernel: [339305.646807] EXT3-fs error (device
dm-1): ext3_add_entry: bad entry in directory #1243562: rec_len % 4 !=
0 - offset=0, inode=873485355, rec_len=14129, name_len=108
Feb 26 10:37:57 mrlinux kernel: [339305.647224] EXT3-fs error (device
dm-1): ext3_add_entry: bad entry in directory #1243562: rec_len % 4 !=
0 - offset=0, inode=873485355, rec_len=14129, name_len=108
Feb 26 10:37:57 mrlinux kernel: [339305.647591] EXT3-fs error (device
dm-1): ext3_add_entry: bad entry in directory #1243562: rec_len % 4 !=
0 - offset=0, inode=873485355, rec_len=14129, name_len=108
Feb 26 10:37:57 mrlinux kernel: [339305.647912] EXT3-fs error (device
dm-1): ext3_add_entry: bad entry in directory #1243562: rec_len % 4 !=
0 - offset=0, inode=873485355, rec_len=14129, name_len=108
Feb 26 10:37:57 mrlinux kernel: [339305.657177] attempt to access
beyond end of device
Feb 26 10:37:57 mrlinux kernel: [339305.657185] dm-0: rw=0,
want=2157354904, limit=97648640
Feb 26 10:37:57 mrlinux kernel: [339305.657192] attempt to access
beyond end of device
Feb 26 10:37:57 mrlinux kernel: [339305.657196] dm-0: rw=0,
want=2157358136, limit=97648640
Feb 26 10:37:57 mrlinux kernel: [339305.657202] attempt to access
beyond end of device
...and finally, and hour after that when I noticed and tried to check
the new disk:
Feb 26 11:28:19 mrlinux kernel: [342327.335517] dm-0: rw=0,
want=12776039208, limit=97648640
Feb 26 11:28:19 mrlinux kernel: [342327.335523] attempt to access
beyond end of device
Feb 26 11:28:19 mrlinux kernel: [342327.335527] dm-0: rw=0,
want=6906147664, limit=97648640
Feb 26 11:28:19 mrlinux kernel: [342327.368414] apache2[12760]:
segfault at 0 ip b7794e4f sp bffd1245 error 6 in apache2[b7752000+69000]
Feb 26 11:30:43 mrlinux kernel: [342471.157911] smartctl[11466]:
segfault at 4 ip b77b6a72 sp bfecf650 error 4 in
ld-2.13.so[b77ac000+1c000]
Feb 26 11:32:24 mrlinux kernel: [342572.338509] attempt to access
beyond end of device
Feb 26 11:32:24 mrlinux kernel: [342572.338518] dm-0: rw=0,
want=6906147664, limit=97648640
Feb 26 11:32:24 mrlinux kernel: [342572.338554] apache2[24950]:
segfault at 0 ip b7794e4f sp bffd1245 error 6 in apache2[b7752000+69000]
Feb 26 11:53:28 mrlinux kernel: [343835.861163] EXT3-fs error (device
dm-1): ext3_add_entry: bad entry in directory #1243562: rec_len % 4 !=
0 - offset=0, inode=873485355, rec_len=14129, name_len=108
Feb 26 11:53:28 mrlinux kernel: [343835.901372] EXT3-fs error (device
dm-1): ext3_add_entry: bad entry in directory #1243562: rec_len % 4 !=
0 - offset=0, inode=873485355, rec_len=14129, name_len=108
Feb 26 11:53:28 mrlinux kernel: [343835.930939] EXT3-fs error (device
dm-1): ext3_add_entry: bad entry in directory #1243562: rec_len % 4 !=
0 - offset=0, inode=873485355, rec_len=14129, name_len=108
Feb 26 11:53:28 mrlinux kernel: [343835.963208] EXT3-fs error (device
dm-1): ext3_add_entry: bad entry in directory #1243562: rec_len % 4 !=
0 - offset=0, inode=873485355, rec_len=14129, name_len=108
Feb 26 11:53:28 mrlinux kernel: [343835.965505] attempt to access
beyond end of device
Feb 26 11:53:28 mrlinux kernel: [343835.965514] dm-0: rw=0,
want=14955012880, limit=97648640
Feb 26 12:08:15 mrlinux kernel: [344722.924052] EXT3-fs error (device
dm-1): ext3_add_entry: bad entry in directory #408850: rec_len % 4 !=
0 - offset=0, inode=134901586, rec_len=17695, name_len=24
Feb 26 12:08:17 mrlinux kernel: [344724.810787] EXT3-fs (dm-1): error
in ext3_new_inode: IO failure
Feb 26 12:08:17 mrlinux kernel: [344724.885464] attempt to access
beyond end of device
,,,and some more for good measure:
Feb 26 12:08:35 mrlinux kernel: [344743.007400] EXT3-fs error (device
dm-1): ext3_free_blocks: Freeing blocks not in datazone
- block = 858599726, count = 1
Feb 26 12:08:35 mrlinux kernel: [344743.069558] EXT3-fs error (device
dm-1): ext3_free_blocks: Freeing blocks not in datazone
- block = 1650811950, count = 1
Having rebooted the segfaults are gone, and I can confirm that the new
disk seems fine:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
Always - 0
3 Spin_Up_Time 0x0027 100 253 021 Pre-fail
Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age
Always - 4
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age
Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age
Always - 46
10 Spin_Retry_Count 0x0032 100 253 000 Old_age
Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 4
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
Always - 3
193 Load_Cycle_Count 0x0032 200 200 000 Old_age
Always - 108
194 Temperature_Celsius 0x0022 131 113 000 Old_age
Always - 19
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age
Offline - 0
SMART Error Log Version: 1
No Errors Logged
However, all three of my data partitions on the RAID5 volume have errors
beyond what the boot time fsck -a/-p will repair, so I'm looking into
those. Seems fine though as I was able to mount them and copy everything
to an external disk fine before recovery.
If you made it this far, thanks! Any pointers?
- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html