On Mon, 2013-01-07 at 23:19 -0600, Stan Hoeppner wrote: > On 1/7/2013 8:05 PM, Ross Boylan wrote: > > I see my array is reconstructing, but I can't tell which disk failed. > > > md0 : active raid1 sda1[0] sdc2[2] sdb2[1] > > 96256 blocks [3/3] [UUU] > > > > md1 : active raid1 sda3[0] sdc4[2] sdb4[1] > > 730523648 blocks [3/3] [UUU] > > Your two md/RAID1 arrays are built on partitions on the same set of 3 > disks. You likely didn't have a disk failure, or md0 would be > rebuilding as well. Your failure, or hiccup, is of some other nature, > and apparently only affected md1. I assume something went wrong while accessing one of the partitions, and that there is a problem with the disk that partition is on. Phrased more carefully, which partition failed and is being resynced into md1? I can't tell. If I knew, would it be safe to mdadm -fail that partition in the midst of the rebuild? Once the system starts md0 is almost never accessed (it's /boot). > > > [>....................] resync = 0.4% (3382400/730523648) finish=14164.9min speed=855K/sec > > Rebuilding a RAID1 on modern hardware should scream. You're getting > resync throughput of less than 1MB/s. Estimated completion time is 9.8 > _days_ to rebuild a mirror partition. This is insanely high. Yes. It seems to be doing better now: # date; cat /proc/mdstat Mon Jan 7 21:37:46 PST 2013 Personalities : [raid1] md0 : active raid1 sda1[0] sdc2[2] sdb2[1] 96256 blocks [3/3] [UUU] md1 : active raid1 sda3[0] sdc4[2] sdb4[1] 730523648 blocks [3/3] [UUU] [===========>.........] resync = 57.8% (422846976/730523648) finish=452.5min speed=11329K/sec unused devices: <none> This is more in line with what I remember when I originally synced the partitions, which I remember as 4-6 hours (it's clearly still much slower than that pace). > > Either you've tweaked your resync throughput down to 1MB/s, or you have > some other process(es) doing serious IO, robbing the resync of > throughput. Isn't it possible there's a hardware problem, e.g., leading to a failure/retry cycle? > Consider running iotop to determine if another process(es) > is eating IO bandwidth. I did, though it's probably a little late. Here's a fairly typical result (command line as shown on the last line) Total DISK READ: 99.09 K/s | Total DISK WRITE: 25.26 K/s PID USER DISK READ DISK WRITE SWAPIN IO COMMAND 4263 root 0 B/s 0 B/s 0.00 % 8.40 % [kjournald] 1204 root 99.09 K/s 0 B/s 0.00 % 4.68 % [kcopyd] 1193 root 0 B/s 0 B/s 0.00 % 4.68 % [kdmflush] 11874 root 0 B/s 25.26 K/s 0.00 % 0.00 % python /usr/bin/iotop -d 2 -n 20 -b When I restarted the system had been effectively down for ~ 1.5 days, and so I guess it's possible that lots of housekeeping operation was going on. However, top didn't show any noticeable use of CPU. A more recent check show speed continuing to rise; it the value is an average and it started slow that would explain it: date; cat /proc/mdstat Mon Jan 7 22:56:23 PST 2013 Personalities : [raid1] md0 : active raid1 sda1[0] sdc2[2] sdb2[1] 96256 blocks [3/3] [UUU] md1 : active raid1 sda3[0] sdc4[2] sdb4[1] 730523648 blocks [3/3] [UUU] [==================>..] resync = 91.8% (670929280/730523648) finish=19.4min speed=51057K/sec Ross -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html