iostat messed up with md on 2.6.16.x

Pallai Roland <dap@xxxxxxxxxxxxx> · Wed, 24 May 2006 02:50:23 +0200

Hi,

 I upgraded my kernel from 2.6.15.6 to 2.6.16.16 and now the 'iostat -x
1' permanently shows 100% utilisation on each disk that member of an md
array. I asked my friend who using 3 boxes with 2.6.16.2 2.6.16.9
2.6.16.11 and raid1, he's reported the same too. it works for anyone?
I don't think that it's exactly a md problem, but only appears with the
md, so I wrote here

 I did a basic debugging on evening and I think the problem is the
double calling of disk->in_flight-- in block/ll_rw_blk.c  -  I dont know
why, but here's a sample line from /proc/diskstats after a raid array
assembled:

    8    0 sda 52 1134 8256 568 3 7 24 16 4294967295 433820 4294534144
                                          ^^^^^^^^^^ in_flight = -1

 I wrote an ugly workaround and now the iostat working well [see
attach#1], but if it's a real bug, someone should find the root cause of
it, please


 ps: the 'proactive disk replacement' patch for raid5 has been ported to
2.6.16 and tested by some of my friends and now it's working for us,
lots of bugs fixed since my last post, if you want to give a try, I
attached that too (and sorry for the big post if you're not intrested
in)



--
 dap


raid tombok alatt az in_flight -1 -be csap at amikor elstartol a cucc, ehol a fix
cyrax  2.6.16.2  2.6.16.9  2.6.16.11
dap    2.6.16.16

--- linux-2.6.16.16/block/ll_rw_blk.c.orig	2006-05-24 01:10:34.000000000 +0200
+++ linux-2.6.16.16/block/ll_rw_blk.c	2006-05-24 01:36:45.000000000 +0200
@@ -2746,7 +2746,10 @@
 
 	if (req->rq_disk) {
 		disk_round_stats(req->rq_disk);
-		req->rq_disk->in_flight--;
+		if (req->rq_disk->in_flight > 0)
+		    req->rq_disk->in_flight--;
+		else
+		    printk("attempt_merge: assert(req->rq_disk->in_flight) failed\n");
 	}
 
 	req->ioprio = ioprio_best(req->ioprio, next->ioprio);
@@ -3408,7 +3411,10 @@
 		__disk_stat_inc(disk, ios[rw]);
 		__disk_stat_add(disk, ticks[rw], duration);
 		disk_round_stats(disk);
-		disk->in_flight--;
+		if (disk->in_flight > 0)
+		    disk->in_flight--;
+		else
+		    printk("end_that_request_last: assert(disk->in_flight) failed\n");
 	}
 	if (req->end_io)
 		req->end_io(req, error);
Changelog:
8
 - ...


 This is a feature patch that implements 'proactive raid5 disk
replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
that could help a lot on large raid5 arrays built from cheap sata
drives when the IO traffic such large that daily media scan on the
disks isn't possible.
 An atypical breakdown situation is when a drive gets kicked from the
array due to a bad block, I replace it but the resync fails cause
another 2-3 disks has hidden badblocks too. In this situation I've
to save the disks with dd and rebuild bad blocks with a userspace
tool (by hand), meanwhile the site is down for hours. This patch
tries to provide a solution for this problem, the two main feature is:
 1. Don't kick a drive on read error cause it is possible that 99.99% is
useable and will help (to serve and to save data) if another drive shows
bad sectors in same array - Neil's new (experimental) sector rewrite
feature included, the first thing is always a try to rewrite the
bad sector
 2. Let to mirror a partially failed drive to a spare _online_ and
replace the source of the mirror with the spare when it's done. Bad
blocks isn't a problem unless same stripe is damaged on two disks
what's a rare case. In this way is possible to fix an array with
partially failed drives without data loss and without downtime. In
other words, you never have to degrade the array due to a disk change,
you can do that in optimal state.

 Per-device bad block cache is implemented to speed up arrays with
partially failed drives (replies are often slow from those), also
helps to determine badly damaged drives based on number of bad blocks,
and can take an action if steps over an user defined threshold
(see /proc/sys/dev/raid/badblock_tolerance). Rewrite of a bad block
will delete the entry from the cache.
 performance is affected just a little bit if there's no or some
registered bad blocks, but over a million that could be a problem
currently..

 Some words about error handling: first big change is now you can use an
external error handler, what means that a user-space script will be
called by the kernel to handle the situation. The common method in
this script is to call 'mdadm' and choosing return values (see below).
This is good -for example- if you have 1 spare drive for 2 arrays, a
script can handle it nicely.. If the script has failed to run (or does
not exists), there's a default algorithm, the main guidelines of that:
 - a "disk fail" means that it's oversteps the 'badblock threshold' or
    failed on write
 - if a drive fails in an optimal array and there's no spare the
    disk will be kicked from the array
 - if the drive fails in degraded array the drive _won't_ be kicked.
    processes gets read/write error if data is needed from the
    damaged sectors. if you want the old behavior use an external
    error handler
 - if drive fails and there's a spare then the proactive mirroring
    begins to the spare. the failing drive won't be kicked until
    the mirror has not been done

You should put your external error handler script at location
"/sbin/mdevent"; it gets the following arguments:
 1st: name of the md array (eg.: "md0")
 2nd: kind of the fail event as string, currently always "drivefail"
 3rd: name of the drive (maybe major/minor nr would be better, currently
			 you can translate to that by /proc/partitions)

Let's see how can you handle some situations from the script:
  array is optimal, a disk fails:
    you want to.. fail that drive and add a spare for normal rebuilding
	mdadm -f /dev/$2 /dev/$3
	mdadm -a /dev/$2 /dev/my_spare1
	exit 0
    ..start proactive mirroring of that disk
	mdadm -a /dev/$2 /dev/my_spare1
	exit 0
    ..keep it on and reset the badblock cache
	exit 1
    ..just keep it in sync
	exit 0
    ..let the default action
	exit 2
    Notice that if the proactive mirroring is done the spare won't
    replace the source drive automatically, you should do it by hand
    or by a scheduled task. You've got a last chance to re-think it.


 Well, better if you know, It's an ugly hack, I'm not a kernel guru,
but I love the idea and now I can't live without it on my own servers
(so this is works for me). I hope somebody will implement this
feature once in a much nicer adaptation, I'm trying to maintain
this patch till then..


 (raid6 could be another solution for this problem, but that's the
 big fat evil in my eyes ;)


use:

1. patch the kernel, this one is against 2.6.14
2. type:

# make the drives
mdadm -B -n1 -l faulty -c4 /dev/md/1 /dev/rd/0
mdadm -B -n1 -l faulty -c4 /dev/md/2 /dev/rd/1
mdadm -B -n1 -l faulty -c4 /dev/md/3 /dev/rd/2

# make the array
mdadm -C -n3 -l5 /dev/md/0 /dev/md/1 /dev/md/2 /dev/md/3

# .. wait for sync ..

# grow bad blocks as ma*tor does :)
mdadm --grow -l faulty -p rp454 /dev/md/1
mdadm --grow -l faulty -p rp738 /dev/md/2

# add a spare
mdadm -a /dev/md/0 /dev/rd/4

# -> fail a drive, sync begins <-
#  the md/1 will not be marked as failed, this is the point, but
#  if you want to, you can issue this command again
mdadm -f /dev/md/0 /dev/md/1

# kernel:
#  resync from md1 to spare ram4
#  added spare for active resync

# .. wonder the read errors from md[12] and the sync goes on!
# feel free to stress the md at this time, mkfs, dd, badblocks, etc

# kernel:
#  raid5_spare_active: 3 in_sync 3->0
# /proc/mdstat:
#  md0 : active raid5 ram4[0] md3[2] md2[1] md1[0]
# -> ram4 and md1 has same id, this means the spare is a complete
#	mirror, if you stop the array you can assembly it with ram4
#	instead of md1, the superblock is same on them

# check the mirror (stop write stress if any)
mdadm --grow -l faulty -p none /dev/md/1
cmp /dev/md/1 /dev/rd/4

# hot-replace the mirrored -partially failed- device with the active
#  spare (yes, mark it as failed again, but if there's a syncing- or
#  synced 'active spare' the -f really fails the device or replace
#  it with the synced spare)
mdadm -f /dev/md/0 /dev/md/1

# kernel:
#  replace md1 with in_sync active spare ram4

# and voila!
# /proc/mdstat:
#  md0 : active raid5 ram4[0] md3[2] md2[1]

--- linux/include/linux/sysctl.h.orig	2005-11-08 14:41:06.000000000 +0100
+++ linux/include/linux/sysctl.h	2005-11-09 20:08:51.000000000 +0100
@@ -807,7 +807,9 @@
 /* /proc/sys/dev/raid */
 enum {
 	DEV_RAID_SPEED_LIMIT_MIN=1,
-	DEV_RAID_SPEED_LIMIT_MAX=2
+	DEV_RAID_SPEED_LIMIT_MAX=2,
+	DEV_RAID_BADBLOCK_TOLERANCE=3,
+        DEV_RAID_MAX_REWRITES=4
 };
 
 /* /proc/sys/dev/parport/default */
--- linux/include/linux/raid/md_k.h.orig	2005-10-28 02:02:08.000000000 +0200
+++ linux/include/linux/raid/md_k.h	2005-11-09 20:06:02.000000000 +0100
@@ -173,6 +173,11 @@
 	char				uuid[16];
 
 	struct mdk_thread_s		*thread;	/* management thread */
+	struct eeh_data {
+		int			failed_num;	/* drive #, pass to raid5_add_disk()
+							    from handle_stripe() */
+	} eeh_data;
+
 	struct mdk_thread_s		*sync_thread;	/* doing resync or reconstruct */
 	sector_t			curr_resync;	/* blocks scheduled */
 	unsigned long			resync_mark;	/* a recent timestamp */
--- linux/include/linux/raid/raid5.h.orig	2005-11-08 18:26:48.000000000 +0100
+++ linux/include/linux/raid/raid5.h	2005-11-09 22:27:58.000000000 +0100
@@ -156,6 +156,7 @@
 #define	R5_Overlap	7	/* There is a pending overlapping request on this block */
 #define	R5_ReadError	8	/* seen a read error here recently */
 #define	R5_ReWrite	9	/* have tried to over-write the readerror */
+#define	R5_HardReadErr	10	/* rewrite failed, put into badblocks list */
 
 /*
  * Write method
@@ -200,8 +201,17 @@
  */
  
 
+struct badblock {
+	struct badblock		*hash_next, **hash_pprev; /* hash pointers */
+	sector_t		sector; /* stripe # */
+	int			n;	/* # of faults */
+};
+
 struct disk_info {
 	mdk_rdev_t	*rdev;
+	struct badblock **badblock_hashtbl; /* list of known badblocks */
+	char		cache_name[20];
+	kmem_cache_t	*slab_cache; /* badblock db */
 };
 
 struct raid5_private_data {
@@ -238,6 +248,8 @@
 	int			inactive_blocked;	/* release of inactive stripes blocked,
 							 * waiting for 25% to be free
 							 */        
+	int			mirrorit; /* source for active spare resync */
+
 	spinlock_t		device_lock;
 	struct disk_info	disks[0];
 };
--- linux/drivers/md/md.c.orig	2005-10-28 02:02:08.000000000 +0200
+++ linux/drivers/md/md.c	2005-11-09 20:18:39.000000000 +0100
@@ -84,6 +84,11 @@
  * or /sys/block/mdX/md/sync_speed_{min,max}
  */
 
+/* the drive'll be marked failed over this threshold. measure is block. */
+int sysctl_badblock_tolerance = 10000;
+int sysctl_max_rewrites = 10;
+
+
 static int sysctl_speed_limit_min = 1000;
 static int sysctl_speed_limit_max = 200000;
 static inline int speed_min(mddev_t *mddev)
@@ -117,6 +122,22 @@
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= DEV_RAID_BADBLOCK_TOLERANCE,
+		.procname	= "badblock_tolerance",
+		.data		= &sysctl_badblock_tolerance,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= DEV_RAID_MAX_REWRITES,
+		.procname	= "max_rewrites",
+		.data		= &sysctl_max_rewrites,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 	{ .ctl_name = 0 }
 };
 
@@ -5042,6 +5063,9 @@
 EXPORT_SYMBOL(md_wakeup_thread);
 EXPORT_SYMBOL(md_print_devices);
 EXPORT_SYMBOL(md_check_recovery);
+EXPORT_SYMBOL(kick_rdev_from_array);	// fixme
+EXPORT_SYMBOL(sysctl_badblock_tolerance);
+EXPORT_SYMBOL(sysctl_max_rewrites);
 MODULE_LICENSE("GPL");
 MODULE_ALIAS("md");
 MODULE_ALIAS_BLOCKDEV_MAJOR(MD_MAJOR);
--- linux/drivers/md/raid5.c.orig	2005-11-08 18:26:48.000000000 +0100
+++ linux/drivers/md/raid5.c	2005-11-10 02:32:52.000000000 +0100
@@ -40,6 +40,19 @@
 
 #define stripe_hash(conf, sect)	(&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
 
+ /*
+ * per-device badblock cache
+ */
+
+#define	HASH_PAGES_ORDER	0
+#define	BB_SHIFT		(PAGE_SHIFT/*12*/ - 9)
+#define	BB_HASH_PAGES		1
+#define	BB_NR_HASH		(BB_HASH_PAGES * PAGE_SIZE / sizeof(struct badblock *))
+#define	BB_HASH_MASK		(BB_NR_HASH - 1)
+
+#define	bb_hash(disk, sect)	((disk)->badblock_hashtbl[((sect) >> BB_SHIFT) & BB_HASH_MASK])
+#define	bb_hashnr(sect)		(((sect) >> BB_SHIFT) & BB_HASH_MASK)
+
 /* bio's attached to a stripe+device for I/O are linked together in bi_sector
  * order without overlap.  There may be several bio's per stripe+device, and
  * a bio could span several devices.
@@ -61,13 +74,165 @@
 # define CHECK_DEVLOCK()
 #endif
 
-#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(x)))
+#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(KERN_DEBUG x)))
 #if RAID5_DEBUG
 #define inline
 #define __inline__
 #endif
 
 static void print_raid5_conf (raid5_conf_t *conf);
+extern int sysctl_badblock_tolerance;
+extern int sysctl_max_rewrites;
+
+
+static void bb_insert_hash(struct disk_info *disk, struct badblock *bb)
+{
+	struct badblock **bbp = &bb_hash(disk, bb->sector);
+
+	/*printk("bb_insert_hash(), sector %llu hashnr %lu\n", (unsigned long long)bb->sector,
+		bb_hashnr(bb->sector));*/
+
+	if ((bb->hash_next = *bbp) != NULL)
+		(*bbp)->hash_pprev = &bb->hash_next;
+	*bbp = bb;	
+	bb->hash_pprev = bbp;
+}
+
+static void bb_remove_hash(struct badblock *bb)
+{
+	/*printk("remove_hash(), sector %llu hashnr %lu\n", (unsigned long long)bb->sector,
+		bb_hashnr(bb->sector));*/
+
+	if (bb->hash_pprev) {
+		if (bb->hash_next)
+			bb->hash_next->hash_pprev = bb->hash_pprev;
+		*bb->hash_pprev = bb->hash_next;
+		bb->hash_pprev = NULL;
+	}
+}
+
+static struct badblock *__find_badblock(struct disk_info *disk, sector_t sector, int n)
+{
+	struct badblock *bb;
+
+	for (bb = bb_hash(disk, sector); bb; bb = bb->hash_next)
+		if (bb->sector == sector && bb->n >= n)
+			return bb;
+	return NULL;
+}
+
+static struct badblock *find_badblock(struct disk_info *disk, sector_t sector, int n)
+{
+	raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private;
+	struct badblock *bb;
+
+	spin_lock_irq(&conf->device_lock);
+	bb = __find_badblock(disk, sector, n);
+	spin_unlock_irq(&conf->device_lock);
+	return bb;
+}
+
+static unsigned long count_badblocks (struct disk_info *disk, int n)
+{
+	raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private;
+	struct badblock *bb;
+	int j;
+	int i = 0;
+
+	spin_lock_irq(&conf->device_lock);
+	for (j = 0; j < BB_NR_HASH; j++) {
+		bb = disk->badblock_hashtbl[j];
+		for (; bb; bb = bb->hash_next)
+			if (bb->n >= n)
+				i++;
+	}
+	spin_unlock_irq(&conf->device_lock);
+
+	return i;
+}
+
+static int grow_badblocks(struct disk_info *disk)
+{
+	char b[BDEVNAME_SIZE];
+	kmem_cache_t *sc;
+
+	/* hash table */
+	if ((disk->badblock_hashtbl = (struct badblock **) __get_free_pages(GFP_ATOMIC, HASH_PAGES_ORDER)) == NULL) {
+	    printk("grow_badblocks: __get_free_pages failed\n");
+	    return 0;
+	}
+	memset(disk->badblock_hashtbl, 0, BB_HASH_PAGES * PAGE_SIZE);
+
+	/* badblocks db */
+	sprintf(disk->cache_name, "raid5/%s_%s_bbc", mdname(disk->rdev->mddev),
+			bdevname(disk->rdev->bdev, b));
+	sc = kmem_cache_create(disk->cache_name,
+			       sizeof(struct badblock),
+			       0, 0, NULL, NULL);
+	if (!sc) {
+		printk("grow_badblocks: kmem_cache_create failed\n");
+		return 1;
+	}
+	disk->slab_cache = sc;
+
+	return 0;
+}
+
+static void shrink_badblocks(struct disk_info *disk)
+{
+	struct badblock *bb;
+	int j;
+
+	/* badblocks db */
+	for (j = 0; j < BB_NR_HASH; j++) {
+		bb = disk->badblock_hashtbl[j];
+		for (; bb; bb = bb->hash_next)
+		        kmem_cache_free(disk->slab_cache, bb);
+	}
+	kmem_cache_destroy(disk->slab_cache);
+	disk->slab_cache = NULL;
+
+	/* hash table */
+	free_pages((unsigned long) disk->badblock_hashtbl, HASH_PAGES_ORDER);
+}
+
+static struct badblock *store_badblock(struct disk_info *disk, sector_t sector)
+{
+	struct badblock *bb;
+	raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private;
+
+	bb = kmem_cache_alloc(disk->slab_cache, GFP_KERNEL);
+	if (!bb) {
+		printk("store_badblock: kmem_cache_alloc failed\n");
+		return NULL;
+	}
+	memset(bb, 0, sizeof(*bb));
+	bb->sector = sector;
+	bb->n = 0;
+
+	spin_lock_irq(&conf->device_lock);
+	bb_insert_hash(disk, bb);
+	spin_unlock_irq(&conf->device_lock);
+
+	return bb;
+}
+
+static void delete_badblock(struct disk_info *disk, sector_t sector)
+{
+	struct badblock *bb;
+	raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private;
+
+	/* delete only from the HardReadErr list */
+	bb = find_badblock(disk, sector, sysctl_max_rewrites);
+	if (!bb)
+		/* reset on write'll call us like an idiot :} */
+		return;
+	spin_lock_irq(&conf->device_lock);
+	bb_remove_hash(bb);
+	kmem_cache_free(disk->slab_cache, bb);
+	spin_unlock_irq(&conf->device_lock);
+}
+
 
 static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 {
@@ -198,7 +363,7 @@
 	sh->pd_idx = pd_idx;
 	sh->state = 0;
 
-	for (i=disks; i--; ) {
+	for (i=disks+1; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 
 		if (dev->toread || dev->towrite || dev->written ||
@@ -291,12 +456,12 @@
 	sh = kmem_cache_alloc(conf->slab_cache, GFP_KERNEL);
 	if (!sh)
 		return 0;
-	memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev));
+	memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1+1)*sizeof(struct r5dev));
 	sh->raid_conf = conf;
 	spin_lock_init(&sh->lock);
 
-	if (grow_buffers(sh, conf->raid_disks)) {
-		shrink_buffers(sh, conf->raid_disks);
+	if (grow_buffers(sh, conf->raid_disks+1)) {
+		shrink_buffers(sh, conf->raid_disks+1);
 		kmem_cache_free(conf->slab_cache, sh);
 		return 0;
 	}
@@ -315,8 +480,10 @@
 
 	sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev));
 
+	/* +1: we need extra space in the *sh->devs for the 'active spare' to keep
+	    handle_stripe() simple */
 	sc = kmem_cache_create(conf->cache_name, 
-			       sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
+			       sizeof(struct stripe_head)+(devs-1+1)*sizeof(struct r5dev),
 			       0, 0, NULL, NULL);
 	if (!sc)
 		return 1;
@@ -435,7 +602,22 @@
 		else {
 			clear_bit(R5_ReadError, &sh->dev[i].flags);
 			clear_bit(R5_ReWrite, &sh->dev[i].flags);
-			md_error(conf->mddev, conf->disks[i].rdev);
+
+	    		/*
+			    rule 1.,: try to keep all disk in_sync even if we've got
+			    unfixable read errors, cause the 'active spare' may can
+			    rebuild a complete column from partially failed drives
+	    		*/
+			if (test_bit(In_sync, &conf->disks[i].rdev->flags)) {
+			    char b[BDEVNAME_SIZE];
+			    printk(KERN_ALERT
+				    "raid5_end_read_request: Read failure %s on sector %llu (%d) in %s mode\n",
+				    bdevname(conf->disks[i].rdev->bdev, b),
+				    (unsigned long long)sh->sector, atomic_read(&sh->count),
+				    conf->working_disks >= conf->raid_disks ? "optimal" : "degraded");
+			    set_bit(R5_HardReadErr, &sh->dev[i].flags);
+			} else
+			    md_error(conf->mddev, conf->disks[i].rdev);
 		}
 	}
 	rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
@@ -472,14 +654,26 @@
 	PRINTK("end_write_request %llu/%d, count %d, uptodate: %d.\n", 
 		(unsigned long long)sh->sector, i, atomic_read(&sh->count),
 		uptodate);
+	/* sorry
 	if (i == disks) {
 		BUG();
 		return 0;
-	}
+	}*/
 
 	spin_lock_irqsave(&conf->device_lock, flags);
-	if (!uptodate)
+	if (!uptodate) {
+	        char b[BDEVNAME_SIZE];
+		/*  we must fail this drive, cause risks the integrity of data
+		    if this sector is readable. later, we could check
+		    is it this readable, if not, then we can handle it as a
+		    common badblock. */
+	        printk(KERN_ALERT
+			"raid5_end_write_request: Write failure %s on sector %llu (%d) in %s mode\n",
+		        bdevname(conf->disks[i].rdev->bdev, b),
+			(unsigned long long)sh->sector, atomic_read(&sh->count),
+		        conf->working_disks >= conf->raid_disks ? "optimal" : "degraded");
 		md_error(conf->mddev, conf->disks[i].rdev);
+	}
 
 	rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
 	
@@ -509,33 +703,156 @@
 	dev->req.bi_private = sh;
 
 	dev->flags = 0;
-	if (i != sh->pd_idx)
+	if (i != sh->pd_idx && i < sh->raid_conf->raid_disks)	/* active spare? */
 		dev->sector = compute_blocknr(sh, i);
 }
 
+static int raid5_remove_disk(mddev_t *mddev, int number);
+static int raid5_add_disk(mddev_t *mddev, mdk_rdev_t *rdev);
+/*static*/ void kick_rdev_from_array(mdk_rdev_t * rdev);
 static void error(mddev_t *mddev, mdk_rdev_t *rdev)
 {
 	char b[BDEVNAME_SIZE];
+	char b2[BDEVNAME_SIZE];
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
 	PRINTK("raid5: error called\n");
 
 	if (!test_bit(Faulty, &rdev->flags)) {
-		mddev->sb_dirty = 1;
-		if (test_bit(In_sync, &rdev->flags)) {
-			conf->working_disks--;
-			mddev->degraded++;
-			conf->failed_disks++;
+		int mddisks = 0;
+		mdk_rdev_t *rd;
+		mdk_rdev_t *rdevs = NULL;
+		struct list_head *rtmp;
+		int i;
+
+		ITERATE_RDEV(mddev,rd,rtmp)
+		    {
+			printk("mddev%d: %s\n", mddisks, bdevname(rd->bdev,b));
+			mddisks++;
+		    }
+		for (i = 0; i < mddisks && (rd = conf->disks[i].rdev); i++) {
+			printk("r5dev%d: %s\n", i, bdevname(rd->bdev,b));
+		}
+		ITERATE_RDEV(mddev,rd,rtmp)
+		    {
+			rdevs = rd;
+			break;
+		    }
+printk("II %d %d > %d %d ins:%d %p\n",
+	mddev->raid_disks, mddisks, conf->raid_disks, mddev->degraded,
+	test_bit(In_sync, &rdev->flags), rdevs);
+		if (conf->disks[conf->raid_disks].rdev == rdev
+		    && conf->mirrorit != -1) {
+			/* in_sync, but must be handled specially, don't let 'degraded++' */
+		        printk (KERN_ALERT "active spare has failed %s (in_sync %d)\n",
+				    bdevname(rdev->bdev,b), test_bit(In_sync, &rdev->flags));
+			mddev->sb_dirty = 1;
+			if (test_bit(In_sync, &rdev->flags))
+				rdev->raid_disk = conf->raid_disks; /* me as myself, again ;) */
 			clear_bit(In_sync, &rdev->flags);
-			/*
-			 * if recovery was running, make sure it aborts.
-			 */
-			set_bit(MD_RECOVERY_ERR, &mddev->recovery);
-		}
-		set_bit(Faulty, &rdev->flags);
-		printk (KERN_ALERT
-			"raid5: Disk failure on %s, disabling device."
-			" Operation continuing on %d devices\n",
-			bdevname(rdev->bdev,b), conf->working_disks);
+			set_bit(Faulty, &rdev->flags);
+		        conf->mirrorit = -1;
+		} else if (mddisks > conf->raid_disks && !mddev->degraded
+		    && test_bit(In_sync, &rdev->flags)) {
+		    /* have active spare, array is optimal, removed disk member
+			    of it (but not the active spare) */
+		    if (rdev->raid_disk == conf->mirrorit && conf->disks[conf->raid_disks].rdev) {
+			if (!test_bit(In_sync, &conf->disks[conf->raid_disks].rdev->flags)) {
+			    printk(KERN_ALERT "disk %s failed and active spare isn't in_sync yet, readd as normal spare\n",
+					bdevname(rdev->bdev,b));
+			    conf->mirrorit = -1;
+			    goto letitgo;
+			} else {
+			    int ret;
+
+			    /* hot replace the mirrored drive with the 'active spare'
+				this is really "hot", I can't see clearly the things
+				what I have to do here. :}
+				pray. */
+
+			    printk(KERN_ALERT "replace %s with in_sync active spare %s\n",
+				    bdevname(rdev->bdev,b),
+				    bdevname(rdevs->bdev,b2));
+			    clear_bit(In_sync, &rdev->flags);
+			    set_bit(Faulty, &rdev->flags);
+
+			    conf->mirrorit = -1;
+
+			    /* my God, am I sane? */
+			    while ((i = atomic_read(&rdev->nr_pending))) {
+				printk("waiting for disk %d .. %d\n",
+					rdev->raid_disk, i);
+			    }
+			    ret = raid5_remove_disk(mddev, rdev->raid_disk);
+			    if (ret) {
+				printk(KERN_ERR "raid5_remove_disk1: busy?!\n");
+				return;	// should nothing to do
+			    }
+
+			    rd = conf->disks[conf->raid_disks].rdev;
+			    while ((i = atomic_read(&rd->nr_pending))) {
+				printk("waiting for disk %d .. %d\n",
+					conf->raid_disks, i);
+			    }
+			    clear_bit(In_sync, &rd->flags);
+			    ret = raid5_remove_disk(mddev, conf->raid_disks);
+			    if (ret) {
+				printk(KERN_ERR "raid5_remove_disk2: busy?!\n");
+				return;	// ..
+			    }
+
+			    ret = raid5_add_disk(mddev, rd);
+			    if (!ret) {
+				printk(KERN_ERR "raid5_add_disk: no free slot?!\n");
+				return;	// ..
+			    }
+			    set_bit(In_sync, &rd->flags);
+
+			    /* borrowed from hot_remove_disk() */
+			    kick_rdev_from_array(rdev);
+			    mddev->sb_dirty = 1;
+			}
+		    } else {
+			/* in_sync disk failed (!degraded), have a spare, starting
+			    proactive mirroring */
+			if (conf->mirrorit == -1) {
+				printk(KERN_ALERT "resync from %s to spare %s (%d)\n",
+					bdevname(rdev->bdev,b),
+			    		bdevname(rdevs->bdev,b2),
+					conf->raid_disks);
+
+				conf->mirrorit = rdev->raid_disk;
+
+				mddev->degraded++;	/* to call raid5_hot_add_disk(), reset there */
+			} else {
+				printk(KERN_ALERT "proactive mirroring is active, let this device go\n");
+				goto letitgo;
+			}
+		    }
+		} else {
+letitgo:
+		    mddev->sb_dirty = 1;
+		    if (test_bit(In_sync, &rdev->flags)) {
+			    conf->working_disks--;
+			    mddev->degraded++;
+			    conf->failed_disks++;
+			    clear_bit(In_sync, &rdev->flags);
+			    /* error() was not called if the syncing was stopped by IO error */
+			    if (conf->mirrorit != -1 &&
+				!test_bit(In_sync, &conf->disks[conf->raid_disks].rdev->flags)) {
+				    printk(KERN_NOTICE "stop proactive mirroring\n");
+				    conf->mirrorit = -1;
+			    }
+			    /*
+			     * if recovery was running, make sure it aborts.
+			     */
+			    set_bit(MD_RECOVERY_ERR, &mddev->recovery);
+		    }
+		    set_bit(Faulty, &rdev->flags);
+		    printk (KERN_ALERT
+			    "raid5: Disk failure on %s, disabling device."
+			    " Operation continuing on %d devices\n",
+			    bdevname(rdev->bdev,b), conf->working_disks);
+		}
 	}
 }	
 
@@ -940,6 +1257,8 @@
 	int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
 	int non_overwrite = 0;
 	int failed_num=0;
+	int aspare=0, asparenum=-1;
+	struct disk_info *asparedev;
 	struct r5dev *dev;
 
 	PRINTK("handling stripe %llu, cnt=%d, pd_idx=%d\n",
@@ -952,10 +1271,18 @@
 
 	syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	/* Now to look around and see what can be done */
+	asparedev = &conf->disks[conf->raid_disks];
+	if (!conf->mddev->degraded && asparedev->rdev && !test_bit(Faulty, &asparedev->rdev->flags)
+		&& conf->mirrorit != -1) {
+	    aspare++;
+	    asparenum = sh->raid_conf->mirrorit;
+	    PRINTK("has aspare (%d)\n", asparenum);
+	}
 
 	rcu_read_lock();
-	for (i=disks; i--; ) {
+	for (i=disks+aspare; i--; ) {
 		mdk_rdev_t *rdev;
+		struct badblock *bb = NULL;
 		dev = &sh->dev[i];
 		clear_bit(R5_Insync, &dev->flags);
 
@@ -997,18 +1324,95 @@
 		}
 		if (dev->written) written++;
 		rdev = rcu_dereference(conf->disks[i].rdev);
+		if (rdev && test_bit(In_sync, &rdev->flags) &&
+		    !syncing &&	/* better to try on syncing.. */
+		    !test_bit(R5_UPTODATE, &dev->flags) &&
+		    !test_bit(R5_LOCKED, &dev->flags)) {
+			/* ..potentially deserved to read, we must check it
+			    checkme: it could be a big performance penalty if called
+				without a good reason! it's seems ok for now
+			*/
+			PRINTK("find_badblock %d: %llu\n", i, sh->sector);
+			bb = find_badblock(&conf->disks[i], sh->sector, sysctl_max_rewrites);
+		}
 		if (!rdev || !test_bit(In_sync, &rdev->flags)) {
 			/* The ReadError flag will just be confusing now */
 			clear_bit(R5_ReadError, &dev->flags);
 			clear_bit(R5_ReWrite, &dev->flags);
 		}
 		if (!rdev || !test_bit(In_sync, &rdev->flags)
-		    || test_bit(R5_ReadError, &dev->flags)) {
+		    || test_bit(R5_ReadError, &dev->flags)
+		    || test_bit(R5_HardReadErr, &dev->flags)
+		    || bb) {
+			if (rdev && test_bit(In_sync, &rdev->flags)
+			    && !bb && (test_bit(R5_HardReadErr, &dev->flags)
+				       || test_bit(R5_ReadError, &dev->flags))
+			    ) {
+				/* take an action only if it's a _new_ bad block
+				    and not while proactive mirroring is running */
+
+				if (!aspare || (aspare && test_bit(In_sync, &asparedev->rdev->flags)/*asparenum != i*/)) {
+				    struct badblock *_bb;
+				    /* if aspare is syncing we shouldn't register new
+					bad blocks, after the sync this disk will
+					be kicked anyway */
+
+				    if (test_bit(R5_HardReadErr, &dev->flags)) {
+					printk("R5_HardReadErr find_badblock %d,0: %llu\n", i, sh->sector);
+					/* it is possible that it's already on the list
+					    due to a soft error */
+					_bb = find_badblock(&conf->disks[i], sh->sector, 0);
+					if (!_bb)
+					    _bb = store_badblock(&conf->disks[i], sh->sector);
+
+					if (_bb)
+					    _bb->n = 100;	// fixme
+
+					// ha kozben volt masik IO es azt kapjuk elobb ide??
+					// hasonlo dolog van a rewrite-nal is, egy cipo
+					clear_bit(R5_HardReadErr, &dev->flags);
+				    } else
+				    /* FIXME: ez nem ide kene, mert itt tobbszor
+					is rahivodhat egy hibanal es a +szar lesz,
+					inkabb az _end_read -be kene */
+				    if (test_bit(R5_ReadError, &dev->flags) &&
+					!test_bit(R5_ReWrite, &dev->flags)) {
+					/* mikor tullepi a limitet sector rewrite utan
+					    ReRead -nel a find_bacblock megtalalja az
+					    adott szektort es 'NOT corrected' -el
+					    bebillen a HardReadErr flag, ami vegulis jo */
+					printk("R5_ReadError find_badblock %d,0: %llu\n", i, sh->sector);
+					_bb = find_badblock(&conf->disks[i], sh->sector, 0);
+					if (!_bb)
+					    _bb = store_badblock(&conf->disks[i], sh->sector);
+
+					if (_bb)
+					    _bb->n++;
+				    }
+
+				    if (count_badblocks(&conf->disks[i], sysctl_max_rewrites) >= sysctl_badblock_tolerance) {
+					char b[BDEVNAME_SIZE];
+
+					printk(KERN_ALERT "too many badblocks (%lu) on device %s [%d]\n",
+						    count_badblocks(&conf->disks[i], sysctl_max_rewrites) + 1,
+						    bdevname(conf->disks[i].rdev->bdev, b),
+						    atomic_read(&rdev->nr_pending));
+
+					/* used by raid5_add_disk */
+					conf->mddev->eeh_data.failed_num = i;
+					md_error(conf->mddev, conf->disks[i].rdev);
+				    }
+				}
+			}
 			failed++;
 			failed_num = i;
+			PRINTK("device %d failed for this stripe r%p w%p\n", i, dev->toread, dev->towrite);
 		} else
 			set_bit(R5_Insync, &dev->flags);
 	}
+	if (aspare && failed > 1)
+	    failed--;	/* failed = 1 means "all ok" if we've aspare, this is simplest
+			    method to do our work */
 	rcu_read_unlock();
 	PRINTK("locked=%d uptodate=%d to_read=%d"
 		" to_write=%d failed=%d failed_num=%d\n",
@@ -1066,7 +1470,7 @@
 
 			/* fail any reads if this device is non-operational */
 			if (!test_bit(R5_Insync, &sh->dev[i].flags) ||
-			    test_bit(R5_ReadError, &sh->dev[i].flags)) {
+			    test_bit(R5_ReadError, &sh->dev[i].flags)) { // have meaning of this??
 				bi = sh->dev[i].toread;
 				sh->dev[i].toread = NULL;
 				if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
@@ -1089,6 +1493,13 @@
 		}
 	}
 	if (failed > 1 && syncing) {
+		printk(KERN_ALERT "rebuild stopped by IO error, marking the spare failed\n");
+		//set_bit(Faulty, &conf->disks[failed_num].rdev->flags);
+		/* md_done_sync, wakeup():
+		   -> mdX_sync::md_do_sync() (flags & MD_RECOVERY_ERR)
+		    -> raid5::md_check_recovery()
+		     -> raid5_spare_active
+		*/
 		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
 		clear_bit(STRIPE_SYNCING, &sh->state);
 		syncing = 0;
@@ -1265,6 +1676,28 @@
 					PRINTK("Writing block %d\n", i);
 					locked++;
 					set_bit(R5_Wantwrite, &sh->dev[i].flags);
+					if (aspare && i == asparenum) {
+					    char *ps, *pd;
+
+					    /* mirroring this new block */
+					    PRINTK("Writing to aspare too %d->%d\n",
+							i, conf->raid_disks);
+					    /*if (test_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags)) {
+						printk("bazmeg, ez lokkolt1!!!\n");
+					    }*/
+					    ps = page_address(sh->dev[i].page);
+					    pd = page_address(sh->dev[conf->raid_disks].page);
+					    /* better idea? */
+					    memcpy(pd, ps, STRIPE_SIZE);
+					    set_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags);
+					    set_bit(R5_Wantwrite, &sh->dev[conf->raid_disks].flags);
+					}
+					/* it had the chance on rewrites
+					if (conf->disks[i].rdev
+					    && test_bit(In_sync, &conf->disks[i].rdev->flags)) {
+					    PRINTK("reset badblock on %d: %llu\n", i, sh->sector);
+					    delete_badblock(&conf->disks[i], sh->sector);
+					}*/
 					if (!test_bit(R5_Insync, &sh->dev[i].flags)
 					    || (i==sh->pd_idx && failed == 0))
 						set_bit(STRIPE_INSYNC, &sh->state);
@@ -1310,15 +1743,40 @@
 			/* either failed parity check, or recovery is happening */
 			if (failed==0)
 				failed_num = sh->pd_idx;
+			if (aspare)
+			    failed_num = asparenum;	// source of mirror
 			dev = &sh->dev[failed_num];
+			if (aspare && !test_bit(R5_UPTODATE, &dev->flags)) {
+			    printk("aspare: compute block\n");
+  			    compute_block(sh, failed_num);
+			    uptodate++;		// ???
+			}
 			BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));
 			BUG_ON(uptodate != disks);
 
+			if (aspare) {
+			    char *ps, *pd;
+
+			    ps = page_address(sh->dev[failed_num].page);
+			    pd = page_address(sh->dev[conf->raid_disks].page);
+			    memcpy(pd, ps, STRIPE_SIZE);
+			    PRINTK("R5_Wantwrite to aspare, uptodate: %d %p->%p\n",
+					uptodate, ps, pd);
+			    /*if (test_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags)) {
+				printk("bazmeg, ez lokkolt2!!!\n");
+			    }*/
+
+			    failed_num = conf->raid_disks;
+			    dev = &sh->dev[failed_num];
+			}
 			set_bit(R5_LOCKED, &dev->flags);
 			set_bit(R5_Wantwrite, &dev->flags);
 			clear_bit(STRIPE_DEGRADED, &sh->state);
 			locked++;
 			set_bit(STRIPE_INSYNC, &sh->state);
+			/* !in_sync..
+			printk("reset badblock on %d: %llu\n", failed_num, sh->sector);
+			delete_badblock(&conf->disks[failed_num], sh->sector);*/
 		}
 	}
 	if (syncing && locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
@@ -1356,7 +1814,7 @@
 		bi->bi_size = 0;
 		bi->bi_end_io(bi, bytes, 0);
 	}
-	for (i=disks; i-- ;) {
+	for (i=disks+aspare; i-- ;) {
 		int rw;
 		struct bio *bi;
 		mdk_rdev_t *rdev;
@@ -1740,6 +2198,102 @@
 	PRINTK("--- raid5d inactive\n");
 }
 
+
+static ssize_t
+raid5_show_badblocks(mddev_t *mddev, char *page)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	int i, j;
+	struct badblock *bb;
+	int k = 0;
+	int l = 0;
+	if (!conf)
+		return 0;
+
+	spin_lock_irq(&conf->device_lock);	/* it's ok now for debug */
+	//seq_printf (seq, "\n      known bad sectors on active devices:");
+	for (i = conf->raid_disks; i--; ) {
+	    if (conf->disks[i].rdev) {
+		//seq_printf (seq, "\n      %s", bdevname(conf->disks[i].rdev->bdev, b));
+		for (j = 0; j < BB_NR_HASH; j++) {
+		    bb = conf->disks[i].badblock_hashtbl[j];
+		    for (; bb; bb = bb->hash_next)
+			/*seq_printf (seq, " %llu-%llu(%d)", bb->sector, bb->sector + (unsigned long long)(STRIPE_SIZE / 512) - 1,
+					bb->n);*/
+			if (bb->n >= 100)
+			    k++;
+			else
+			    l++;
+		}
+	    }
+	}
+	spin_unlock_irq(&conf->device_lock);
+
+	return sprintf(page, "%d %d\n", k, l);
+}
+
+static ssize_t
+raid5_reset_badblocks(mddev_t *mddev, const char *page, size_t len)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	char *end;
+	int new;
+	int i, j;
+	struct badblock *bb;
+	if (len >= PAGE_SIZE)
+		return -EINVAL;
+	if (!conf)
+		return -ENODEV;
+
+	new = simple_strtoul(page, &end, 10);
+	if (!*page || (*end && *end != '\n') )
+		return -EINVAL;
+
+	if (new) {
+	    spin_lock_irq(&conf->device_lock);	/* it's ok now for debug */
+	    for (i = conf->raid_disks; i--; ) {
+		if (conf->disks[i].rdev) {
+		    for (j = 0; j < BB_NR_HASH; j++) {
+			struct badblock *bbprev = NULL;
+
+			bb = conf->disks[i].badblock_hashtbl[j];
+		        for (; bb; bb = bb->hash_next) {
+			    if (bbprev)
+				kmem_cache_free(conf->disks[i].slab_cache, bbprev);
+			    bb_remove_hash(bb);
+			    bbprev = bb;
+			}
+
+			if (bbprev)
+			    kmem_cache_free(conf->disks[i].slab_cache, bbprev);
+		    }
+		}
+	    }
+	    spin_unlock_irq(&conf->device_lock);
+	}
+
+	return len;
+}
+
+static struct md_sysfs_entry
+raid5_badblocks = __ATTR(badblocks, S_IRUGO | S_IWUSR,
+			 raid5_show_badblocks,
+			 raid5_reset_badblocks);
+
+
+static ssize_t
+stripe_cache_active_show(mddev_t *mddev, char *page)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	if (conf)
+		return sprintf(page, "%d\n", atomic_read(&conf->active_stripes));
+	else
+		return 0;
+}
+
+static struct md_sysfs_entry
+raid5_stripecache_active = __ATTR_RO(stripe_cache_active);
+
 static ssize_t
 raid5_show_stripe_cache_size(mddev_t *mddev, char *page)
 {
@@ -1785,22 +2339,10 @@
 				raid5_show_stripe_cache_size,
 				raid5_store_stripe_cache_size);
 
-static ssize_t
-stripe_cache_active_show(mddev_t *mddev, char *page)
-{
-	raid5_conf_t *conf = mddev_to_conf(mddev);
-	if (conf)
-		return sprintf(page, "%d\n", atomic_read(&conf->active_stripes));
-	else
-		return 0;
-}
-
-static struct md_sysfs_entry
-raid5_stripecache_active = __ATTR_RO(stripe_cache_active);
-
 static struct attribute *raid5_attrs[] =  {
 	&raid5_stripecache_size.attr,
 	&raid5_stripecache_active.attr,
+	&raid5_badblocks.attr,
 	NULL,
 };
 static struct attribute_group raid5_attrs_group = {
@@ -1823,7 +2365,7 @@
 	}
 
 	mddev->private = kzalloc(sizeof (raid5_conf_t)
-				 + mddev->raid_disks * sizeof(struct disk_info),
+				 + (mddev->raid_disks + 1) * sizeof(struct disk_info),
 				 GFP_KERNEL);
 	if ((conf = mddev->private) == NULL)
 		goto abort;
@@ -1854,6 +2396,8 @@
 
 		disk->rdev = rdev;
 
+		grow_badblocks(disk);
+
 		if (test_bit(In_sync, &rdev->flags)) {
 			char b[BDEVNAME_SIZE];
 			printk(KERN_INFO "raid5: device %s operational as raid"
@@ -1864,6 +2408,7 @@
 	}
 
 	conf->raid_disks = mddev->raid_disks;
+	conf->mirrorit = -1;
 	/*
 	 * 0 for a fully functional array, 1 for a degraded array.
 	 */
@@ -1921,7 +2466,7 @@
 		}
 	}
 	memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
-		 conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
+		 (conf->raid_disks+1) * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
 	if (grow_stripes(conf, conf->max_nr_stripes)) {
 		printk(KERN_ERR 
 			"raid5: couldn't allocate %dkB for buffers\n", memory);
@@ -1979,10 +2524,14 @@
 static int stop(mddev_t *mddev)
 {
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
+	int i;
 
 	md_unregister_thread(mddev->thread);
 	mddev->thread = NULL;
 	shrink_stripes(conf);
+	for (i = conf->raid_disks; i--; )
+		if (conf->disks[i].rdev && test_bit(In_sync, &conf->disks[i].rdev->flags))
+			shrink_badblocks(&conf->disks[i]);
 	kfree(conf->stripe_hashtbl);
 	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
 	sysfs_remove_group(&mddev->kobj, &raid5_attrs_group);
@@ -2030,6 +2579,11 @@
 {
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
 	int i;
+#if RAID5_DEBUG || 1
+	int j;
+	char b[BDEVNAME_SIZE];
+	struct badblock *bb;
+#endif
 
 	seq_printf (seq, " level %d, %dk chunk, algorithm %d", mddev->level, mddev->chunk_size >> 10, mddev->layout);
 	seq_printf (seq, " [%d/%d] [", conf->raid_disks, conf->working_disks);
@@ -2043,6 +2597,23 @@
 	seq_printf (seq, "<"#x":%d>", atomic_read(&conf->x))
 	printall(conf);
 #endif
+
+#if RAID5_DEBUG || 1
+	spin_lock_irq(&conf->device_lock);	/* it's ok now for debug */
+	seq_printf (seq, "\n      known bad sectors on active devices:");
+	for (i = conf->raid_disks; i--; ) {
+	    if (conf->disks[i].rdev) {
+		seq_printf (seq, "\n      %s", bdevname(conf->disks[i].rdev->bdev, b));
+		for (j = 0; j < BB_NR_HASH; j++) {
+		    bb = conf->disks[i].badblock_hashtbl[j];
+		    for (; bb; bb = bb->hash_next)
+			seq_printf (seq, " %llu-%llu(%d)", bb->sector, bb->sector + (unsigned long long)(STRIPE_SIZE / 512) - 1,
+					bb->n);
+		}
+	    }
+	}
+	spin_unlock_irq(&conf->device_lock);
+#endif
 }
 
 static void print_raid5_conf (raid5_conf_t *conf)
@@ -2085,6 +2656,17 @@
 			set_bit(In_sync, &tmp->rdev->flags);
 		}
 	}
+	tmp = conf->disks + i;
+	if (tmp->rdev && !test_bit(Faulty, &tmp->rdev->flags)
+			&& !test_bit(In_sync, &tmp->rdev->flags)) {
+	    set_bit(In_sync, &tmp->rdev->flags);
+
+	    printk(KERN_NOTICE "raid5_spare_active: %d in_sync %d->%d\n",
+			i, tmp->rdev->raid_disk, conf->mirrorit);
+
+	    /* scary..? :} */
+	    tmp->rdev->raid_disk = conf->mirrorit;
+	}
 	print_raid5_conf(conf);
 	return 0;
 }
@@ -2098,6 +2680,7 @@
 
 	print_raid5_conf(conf);
 	rdev = p->rdev;
+printk("raid5_remove_disk %d\n", number);
 	if (rdev) {
 		if (test_bit(In_sync, &rdev->flags) ||
 		    atomic_read(&rdev->nr_pending)) {
@@ -2111,6 +2694,17 @@
 			err = -EBUSY;
 			p->rdev = rdev;
 		}
+		if (!err) {
+			shrink_badblocks(p);
+
+			/* stopped by IO error.. */
+			if (conf->mirrorit != -1
+			    && conf->disks[conf->raid_disks].rdev == NULL) {
+				printk(KERN_INFO "raid5_remove_disk: IO error on proactive mirroring of %d!\n",
+					    conf->mirrorit);
+				conf->mirrorit = -1;
+			}
+		}
 	}
 abort:
 
@@ -2142,6 +2736,30 @@
 			rcu_assign_pointer(p->rdev, rdev);
 			break;
 		}
+
+	if (!found && conf->disks[disk].rdev == NULL) {
+	    char b[BDEVNAME_SIZE];
+
+	    /* array optimal, this should be the 'active spare' added by eeh_thread/error() */
+	    conf->disks[disk].rdev = rdev;
+	    clear_bit(In_sync, &rdev->flags);
+	    rdev->raid_disk = conf->raid_disks;
+	    conf->fullsync = 1;
+
+	    if (mddev->degraded) /* if we're here and it's true, we're called after error() */
+		    mddev->degraded--;
+	    else
+		    conf->mirrorit = mddev->eeh_data.failed_num;
+	    found = 1;
+
+	    printk(KERN_NOTICE "added spare for proactive replacement of %s\n",
+		    bdevname(conf->disks[conf->mirrorit].rdev->bdev, b));
+	}
+	if (found)
+		grow_badblocks(&conf->disks[disk]);
+	printk(KERN_INFO "raid5_add_disk: %d (%d) in_sync: %d\n", disk, found,
+		    found ? test_bit(In_sync, &rdev->flags) : -1);
+
 	print_raid5_conf(conf);
 	return found;
 }