[PATCH] proactive raid5 disk replacement, 2.6.14

Pallai Roland <dap@xxxxxxxxxxxxx> · Thu, 10 Nov 2005 04:21:28 +0100

*biiiiig* thanks for all developers of bitmap-based raid5 resyncing and
bad block rewriting, both of them is really great feature! :)


 I ported my "proactive thing" to the new kernel, it can live nicely
with the new features now.. some bugs hunted in the last month, it's
quite stable for me

 if you wanna give a try you must apply Neil's badblock rewriting patch
first, attached also



- *readme* -

 This is a feature patch that implements 'proactive raid5 disk
replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
that could help a lot on large raid5 arrays built from cheap sata
drivers when the IO traffic such large that daily media scan on the
disks isn't possible.
 An atypical breakdown situation is when a drive gets kicked from the
array due to a bad block, I replace it but the resync fails cause
another 2-3 disks has hidden badblocks too. In this situation I've
to save the disks with dd and rebuild bad blocks with a userspace
tool (by hand), meanwhile the site is down for hours. This patch
tries to give a solution for this problem, the two main feature is:
 1. Don't kick a drive on read error cause it is possible that 99.99% is
useable and will help (to serve and to save data) if another drive shows
bad sectors in same array - Neil's new (experimental) sector rewrite
feature included, the first thing is always a try to rewrite the bad
sector
 2. Let to mirror a partially failed drive to a spare _online_ and
replace the source of the mirror with the spare when it's done. Bad
blocks isn't a problem unless same stripe is damaged on two disks what's
a rare case. In this way is possible to fix an array with partially
failed drives without data loss and without downtime. In other words,
you never have to degrade the array due to a disk change, you can do
that in optimal state.

 Per-device bad block cache is implemented to speed up arrays with
partially failed drives (replies are often slow from those), also
helps to determine badly damaged drives based on number of bad blocks,
and can take an action if steps over an user defined threshold
(see /proc/sys/dev/raid/badblock_tolerance). Rewrite of a bad block
will delete the entry from the cache.
 performance is affected just a little bit if there's no or some
registered bad blocks, but over a million that could be a problem
currently..

 Some words about error handling: first big change is now you can use an
external error handler, what means that a user-space script will be
called by the kernel to handle the situation. The common method in this
script is to call 'mdadm' and choosing return values (see below). This
is good -for example- if you have 1 spare drive for 2 arrays, a script
can handle it nicely.. If the script has failed to run (or does not
exists), there's a default algorithm, the main guidelines of that:
 - a "disk fail" means that it's oversteps the 'badblock threshold' or
    failed on write
 - if a drive fails in an optimal array and there's no spare the
    disk will be kicked from the array
 - if the drive fails in degraded array the drive _won't_ be kicked.
    processes gets read/write error if data is needed from the
    damaged sectors. if you want the old behavior use an external
    error handler
 - if drive fails and there's a spare then the proactive mirroring
    begins to the spare. the failing drive won't be kicked until
    the mirror has not been done


 Well, better if you know, It's an ugly hack, I'm not a kernel guru,
but I love the idea and now I can't live without it on my own servers
(so this is works for me). I hope somebody will implement this
feature once in a much nicer adaptation, I'm trying to maintain this
patch till then..


You should put your external error handler script at location
"/sbin/mdevent"; it gets the following arguments:
 1st: name of the md array (eg.: "md0")
 2nd: kind of the fail event as string, currently always "drivefail"
 3rd: name of the drive (maybe major/minor nr would be better, currently
			 you can translate to that by /proc/partitions)

Let's see how can you handle some situations from the script:
  array is optimal, a disk fails:
    you want to.. fail that drive and add a spare for normal rebuilding
	mdadm -f /dev/$2 /dev/$3
	mdadm -a /dev/$2 /dev/my_spare1
	exit 0
    ..start proactive mirroring of that disk
	mdadm -a /dev/$2 /dev/my_spare1
	exit 0
    ..keep it on and reset the badblock cache
	exit 1
    ..just keep it in sync
	exit 0
    ..let the default action
	exit 2
    Notice that if the proactive mirroring is done the spare won't
    replace the source drive automatically, you should do it by hand
    or by a scheluded task. You've got a last chance to re-think it.


 (raid6 could be another solution for this problem, but that's the
 big far evil in my eyes ;)


use:

1. patch the kernel, this one is against 2.6.14
2. type:

# make the drives
mdadm -B -n1 -l faulty -c4 /dev/md/1 /dev/rd/0
mdadm -B -n1 -l faulty -c4 /dev/md/2 /dev/rd/1
mdadm -B -n1 -l faulty -c4 /dev/md/3 /dev/rd/2

# make the array
mdadm -C -n3 -l5 /dev/md/0 /dev/md/1 /dev/md/2 /dev/md/3

# .. wait for sync ..

# grow bad blocks as ma*tor does :)
mdadm --grow -l faulty -p rp454 /dev/md/1
mdadm --grow -l faulty -p rp738 /dev/md/2

# add a spare
mdadm -a /dev/md/0 /dev/rd/4

# -> fail a drive, sync begins <-
#  the md/1 will not be marked as failed, this is the point, but
#  if you want to, you can issue this command again
mdadm -f /dev/md/0 /dev/md/1

# kernel:
#  resync from md1 to spare ram4
#  added spare for active resync

# .. wonder the read errors from md[12] and the sync goes on!
# feel free to stress the md at this time, mkfs, dd, badblocks, etc

# kernel:
#  raid5_spare_active: 3 in_sync 3->0
# /proc/mdstat:
#  md0 : active raid5 ram4[0] md3[2] md2[1] md1[0]
# -> ram4 and md1 has same id, this means the spare is a complete mirror,
#       if you stop the array you can assembly it with ram4 instead of md1,
#       the superblock is same on them

# check the mirror (stop write stress if any)
mdadm --grow -l faulty -p none /dev/md/1
cmp /dev/md/1 /dev/rd/4

# hot-replace the mirrored -partially failed- device with the active spare
#  (yes, mark it as failed again, but if there's a syncing- or synced 'active spare'
#       the -f really fails the device or replace it with the synced spare)
mdadm -f /dev/md/0 /dev/md/1

# kernel:
#  replace md1 with in_sync active spare ram4

# and voila!
# /proc/mdstat:
#  md0 : active raid5 ram4[0] md3[2] md2[1]



--
 dap


diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c

--- ./drivers/md/raid5.c~current~	2005-09-16 12:21:24.000000000 +1000
+++ ./drivers/md/raid5.c	2005-09-16 12:57:12.000000000 +1000
@@ -349,7 +349,7 @@ static void shrink_stripes(raid5_conf_t 
 	conf->slab_cache = NULL;
 }
 
-static int raid5_end_read_request (struct bio * bi, unsigned int bytes_done,
+static int raid5_end_read_request(struct bio * bi, unsigned int bytes_done,
 				   int error)
 {
  	struct stripe_head *sh = bi->bi_private;
@@ -401,10 +401,27 @@ static int raid5_end_read_request (struc
 		}
 #else
 		set_bit(R5_UPTODATE, &sh->dev[i].flags);
-#endif		
+#endif
+		if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
+			printk("R5: read error corrected!!\n");
+			clear_bit(R5_ReadError, &sh->dev[i].flags);
+			clear_bit(R5_ReWrite, &sh->dev[i].flags);
+		}
 	} else {
-		md_error(conf->mddev, conf->disks[i].rdev);
 		clear_bit(R5_UPTODATE, &sh->dev[i].flags);
+		if (conf->mddev->degraded) {
+			printk("R5: read error not correctable.\n");
+			clear_bit(R5_ReadError, &sh->dev[i].flags);
+			clear_bit(R5_ReWrite, &sh->dev[i].flags);
+			md_error(conf->mddev, conf->disks[i].rdev);
+		} else if (test_bit(R5_ReWrite, &sh->dev[i].flags)) {
+			/* Oh, no!!! */
+			printk("R5: read error NOT corrected!!\n");
+			clear_bit(R5_ReadError, &sh->dev[i].flags);
+			clear_bit(R5_ReWrite, &sh->dev[i].flags);
+			md_error(conf->mddev, conf->disks[i].rdev);
+		} else
+			set_bit(R5_ReadError, &sh->dev[i].flags);
 	}
 	rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
 #if 0
@@ -966,6 +983,12 @@ static void handle_stripe(struct stripe_
 		if (dev->written) written++;
 		rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
 		if (!rdev || !rdev->in_sync) {
+			/* The ReadError flag wil just be confusing now */
+			clear_bit(R5_ReadError, &dev->flags);
+			clear_bit(R5_ReWrite, &dev->flags);
+		}
+		if (!rdev || !rdev->in_sync
+		    || test_bit(R5_ReadError, &dev->flags)) {
 			failed++;
 			failed_num = i;
 		} else
@@ -980,6 +1003,14 @@ static void handle_stripe(struct stripe_
 	if (failed > 1 && to_read+to_write+written) {
 		for (i=disks; i--; ) {
 			int bitmap_end = 0;
+
+			if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
+				mdk_rdev_t *rdev = conf->disks[i].rdev;
+				if (rdev && rdev->in_sync)
+					/* multiple read failures in one stripe */
+					md_error(conf->mddev, rdev);
+			}
+
 			spin_lock_irq(&conf->device_lock);
 			/* fail all writes first */
 			bi = sh->dev[i].towrite;
@@ -1015,7 +1046,8 @@ static void handle_stripe(struct stripe_
 			}
 
 			/* fail any reads if this device is non-operational */
-			if (!test_bit(R5_Insync, &sh->dev[i].flags)) {
+			if (!test_bit(R5_Insync, &sh->dev[i].flags) ||
+			    test_bit(R5_ReadError, &sh->dev[i].flags)) {
 				bi = sh->dev[i].toread;
 				sh->dev[i].toread = NULL;
 				if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
@@ -1274,7 +1306,26 @@ static void handle_stripe(struct stripe_
 		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
 		clear_bit(STRIPE_SYNCING, &sh->state);
 	}
-	
+
+	/* If the failed drive is just a ReadError, then we might need to progress
+	 * the repair/check process
+	 */
+	if (failed == 1 && test_bit(R5_ReadError, &sh->dev[failed_num].flags)
+	    && !test_bit(R5_LOCKED, &sh->dev[failed_num].flags)
+	    && test_bit(R5_UPTODATE, &sh->dev[failed_num].flags)
+		) {
+		dev = &sh->dev[failed_num];
+		if (!test_bit(R5_ReWrite, &dev->flags)) {
+			set_bit(R5_Wantwrite, &dev->flags);
+			set_bit(R5_ReWrite, &dev->flags);
+			set_bit(R5_LOCKED, &dev->flags);
+		} else {
+			/* let's read it back */
+			set_bit(R5_Wantread, &dev->flags);
+			set_bit(R5_LOCKED, &dev->flags);
+		}
+	}
+
 	spin_unlock(&sh->lock);
 
 	while ((bi=return_bi)) {

diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h
--- ./include/linux/raid/raid5.h~current~	2005-09-16 12:21:24.000000000 +1000
+++ ./include/linux/raid/raid5.h	2005-09-16 12:55:51.000000000 +1000
@@ -154,6 +154,8 @@ struct stripe_head {
 #define	R5_Wantwrite	5
 #define	R5_Syncio	6	/* this io need to be accounted as resync io */
 #define	R5_Overlap	7	/* There is a pending overlapping request on this block */
+#define	R5_ReadError	8	/* seen a read error here recently */
+#define	R5_ReWrite	9	/* have tried to over-write the readerror */
 
 /*
  * Write method
--- linux/include/linux/sysctl.h.orig	2005-11-08 14:41:06.000000000 +0100
+++ linux/include/linux/sysctl.h	2005-11-09 20:08:51.000000000 +0100
@@ -758,7 +758,8 @@
 /* /proc/sys/dev/raid */
 enum {
 	DEV_RAID_SPEED_LIMIT_MIN=1,
-	DEV_RAID_SPEED_LIMIT_MAX=2
+	DEV_RAID_SPEED_LIMIT_MAX=2,
+	DEV_RAID_BADBLOCK_TOLERANCE=3
 };
 
 /* /proc/sys/dev/parport/default */
--- linux/include/linux/raid/md_k.h.orig	2005-10-28 02:02:08.000000000 +0200
+++ linux/include/linux/raid/md_k.h	2005-11-09 20:06:02.000000000 +0100
@@ -165,6 +165,11 @@
 	char				uuid[16];
 
 	struct mdk_thread_s		*thread;	/* management thread */
+	struct mdk_thread_s		*eeh_thread;	/* external error handler */
+	struct eeh_data {
+		int			failed_num;	/* drive # */
+	} eeh_data;
+
 	struct mdk_thread_s		*sync_thread;	/* doing resync or reconstruct */
 	sector_t			curr_resync;	/* blocks scheduled */
 	unsigned long			resync_mark;	/* a recent timestamp */
--- linux/include/linux/raid/raid5.h.orig	2005-11-08 18:26:48.000000000 +0100
+++ linux/include/linux/raid/raid5.h	2005-11-09 22:27:58.000000000 +0100
@@ -156,6 +156,7 @@
 #define	R5_Overlap	7	/* There is a pending overlapping request on this block */
 #define	R5_ReadError	8	/* seen a read error here recently */
 #define	R5_ReWrite	9	/* have tried to over-write the readerror */
+#define	R5_HardReadErr	10	/* rewrite failed, put into badblocks list */
 
 /*
  * Write method
@@ -200,8 +201,16 @@
  */
  
 
+struct badblock {
+	struct badblock		*hash_next, **hash_pprev; /* hash pointers */
+	sector_t		sector; /* stripe # */
+};
+
 struct disk_info {
 	mdk_rdev_t	*rdev;
+	struct badblock **badblock_hashtbl; /* list of known badblocks */
+	char		cache_name[20];
+	kmem_cache_t	*slab_cache; /* badblock db */
 };
 
 struct raid5_private_data {
@@ -238,6 +247,8 @@
 	int			inactive_blocked;	/* release of inactive stripes blocked,
 							 * waiting for 25% to be free
 							 */        
+	int			mirrorit; /* source for active spare resync */
+
 	spinlock_t		device_lock;
 	struct disk_info	disks[0];
 };
--- linux/drivers/md/md.c.orig	2005-10-28 02:02:08.000000000 +0200
+++ linux/drivers/md/md.c	2005-11-09 20:18:39.000000000 +0100
@@ -85,6 +85,10 @@
 static int sysctl_speed_limit_min = 1000;
 static int sysctl_speed_limit_max = 200000;
 
+/* the drive'll be marked failed over this threshold. measure is block. */
+int sysctl_badblock_tolerance = 10000;
+
+
 static struct ctl_table_header *raid_table_header;
 
 static ctl_table raid_table[] = {
@@ -104,6 +108,14 @@
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= DEV_RAID_BADBLOCK_TOLERANCE,
+		.procname	= "badblock_tolerance",
+		.data		= &sysctl_badblock_tolerance,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 	{ .ctl_name = 0 }
 };
 
@@ -4097,6 +4109,8 @@
 EXPORT_SYMBOL(md_wakeup_thread);
 EXPORT_SYMBOL(md_print_devices);
 EXPORT_SYMBOL(md_check_recovery);
+EXPORT_SYMBOL(kick_rdev_from_array);	// fixme
+EXPORT_SYMBOL(sysctl_badblock_tolerance);
 MODULE_LICENSE("GPL");
 MODULE_ALIAS("md");
 MODULE_ALIAS_BLOCKDEV_MAJOR(MD_MAJOR);
--- linux/drivers/md/raid5.c.orig	2005-11-08 18:26:48.000000000 +0100
+++ linux/drivers/md/raid5.c	2005-11-10 02:32:52.000000000 +0100
@@ -42,6 +42,18 @@
 
 #define stripe_hash(conf, sect)	((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK])
 
+ /*
+ * per-device badblock cache
+ */
+
+#define	BB_SHIFT		(PAGE_SHIFT/*12*/ - 9)
+#define	BB_HASH_PAGES		1
+#define	BB_NR_HASH		(HASH_PAGES * PAGE_SIZE / sizeof(struct badblock *))
+#define	BB_HASH_MASK		(BB_NR_HASH - 1)
+
+#define	bb_hash(disk, sect)	((disk)->badblock_hashtbl[((sect) >> BB_SHIFT) & BB_HASH_MASK])
+#define	bb_hashnr(sect)		(((sect) >> BB_SHIFT) & BB_HASH_MASK)
+
 /* bio's attached to a stripe+device for I/O are linked together in bi_sector
  * order without overlap.  There may be several bio's per stripe+device, and
  * a bio could span several devices.
@@ -55,7 +67,7 @@
 /*
  * The following can be used to debug the driver
  */
-#define RAID5_DEBUG	0
+#define RAID5_DEBUG	1
 #define RAID5_PARANOIA	1
 #if RAID5_PARANOIA && defined(CONFIG_SMP)
 # define CHECK_DEVLOCK() assert_spin_locked(&conf->device_lock)
@@ -63,13 +75,162 @@
 # define CHECK_DEVLOCK()
 #endif
 
-#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(x)))
+/* use External Error Handler? */
+#define	USEREH		1
+
+#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(KERN_DEBUG x)))
 #if RAID5_DEBUG
 #define inline
 #define __inline__
 #endif
 
 static void print_raid5_conf (raid5_conf_t *conf);
+extern int sysctl_badblock_tolerance;
+
+
+static void bb_insert_hash(struct disk_info *disk, struct badblock *bb)
+{
+	struct badblock **bbp = &bb_hash(disk, bb->sector);
+
+	/*printk("bb_insert_hash(), sector %llu hashnr %lu\n", (unsigned long long)bb->sector,
+		bb_hashnr(bb->sector));*/
+
+	if ((bb->hash_next = *bbp) != NULL)
+		(*bbp)->hash_pprev = &bb->hash_next;
+	*bbp = bb;	
+	bb->hash_pprev = bbp;
+}
+
+static void bb_remove_hash(struct badblock *bb)
+{
+	/*printk("remove_hash(), sector %llu hashnr %lu\n", (unsigned long long)bb->sector,
+		bb_hashnr(bb->sector));*/
+
+	if (bb->hash_pprev) {
+		if (bb->hash_next)
+			bb->hash_next->hash_pprev = bb->hash_pprev;
+		*bb->hash_pprev = bb->hash_next;
+		bb->hash_pprev = NULL;
+	}
+}
+
+static struct badblock *__find_badblock(struct disk_info *disk, sector_t sector)
+{
+	struct badblock *bb;
+
+	for (bb = bb_hash(disk, sector); bb; bb = bb->hash_next)
+		if (bb->sector == sector)
+			return bb;
+	return NULL;
+}
+
+static struct badblock *find_badblock(struct disk_info *disk, sector_t sector)
+{
+	raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private;
+	struct badblock *bb;
+
+	spin_lock_irq(&conf->device_lock);
+	bb = __find_badblock(disk, sector);
+	spin_unlock_irq(&conf->device_lock);
+	return bb;
+}
+
+static unsigned long count_badblocks (struct disk_info *disk)
+{
+	raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private;
+	struct badblock *bb;
+	int j;
+	int n = 0;
+
+	spin_lock_irq(&conf->device_lock);
+	for (j = 0; j < BB_NR_HASH; j++) {
+		bb = disk->badblock_hashtbl[j];
+		for (; bb; bb = bb->hash_next)
+			n++;
+	}
+	spin_unlock_irq(&conf->device_lock);
+
+	return n;
+}
+
+static int grow_badblocks(struct disk_info *disk)
+{
+	char b[BDEVNAME_SIZE];
+	kmem_cache_t *sc;
+
+	/* hash table */
+	if ((disk->badblock_hashtbl = (struct badblock **) __get_free_pages(GFP_ATOMIC, HASH_PAGES_ORDER)) == NULL) {
+	    printk("grow_badblocks: __get_free_pages failed\n");
+	    return 0;
+	}
+	memset(disk->badblock_hashtbl, 0, BB_HASH_PAGES * PAGE_SIZE);
+
+	/* badblocks db */
+	sprintf(disk->cache_name, "raid5/%s_%s_bbc", mdname(disk->rdev->mddev),
+			bdevname(disk->rdev->bdev, b));
+	sc = kmem_cache_create(disk->cache_name,
+			       sizeof(struct badblock),
+			       0, 0, NULL, NULL);
+	if (!sc) {
+		printk("grow_badblocks: kmem_cache_create failed\n");
+		return 1;
+	}
+	disk->slab_cache = sc;
+
+	return 0;
+}
+
+static void shrink_badblocks(struct disk_info *disk)
+{
+	struct badblock *bb;
+	int j;
+
+	/* badblocks db */
+	for (j = 0; j < BB_NR_HASH; j++) {
+		bb = disk->badblock_hashtbl[j];
+		for (; bb; bb = bb->hash_next)
+		        kmem_cache_free(disk->slab_cache, bb);
+	}
+	kmem_cache_destroy(disk->slab_cache);
+	disk->slab_cache = NULL;
+
+	/* hash table */
+	free_pages((unsigned long) disk->badblock_hashtbl, HASH_PAGES_ORDER);
+}
+
+static void store_badblock(struct disk_info *disk, sector_t sector)
+{
+	struct badblock *bb;
+	raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private;
+
+	bb = kmem_cache_alloc(disk->slab_cache, GFP_KERNEL);
+	if (!bb) {
+		printk("store_badblock: kmem_cache_alloc failed\n");
+		return;
+	}
+	memset(bb, 0, sizeof(*bb));
+	bb->sector = sector;
+
+	spin_lock_irq(&conf->device_lock);
+	bb_insert_hash(disk, bb);
+	spin_unlock_irq(&conf->device_lock);
+}
+
+static void delete_badblock(struct disk_info *disk, sector_t sector)
+{
+	struct badblock *bb;
+	raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private;
+
+	bb = find_badblock(disk, sector);
+	if (!bb)
+		/* reset on write'll call us like an idiot :} */
+		return;
+	spin_lock_irq(&conf->device_lock);
+	bb_remove_hash(bb);
+	kmem_cache_free(disk->slab_cache, bb);
+	spin_unlock_irq(&conf->device_lock);
+}
+
 
 static inline void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 {
@@ -208,7 +369,7 @@
 	sh->pd_idx = pd_idx;
 	sh->state = 0;
 
-	for (i=disks; i--; ) {
+	for (i=disks+1; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 
 		if (dev->toread || dev->towrite || dev->written ||
@@ -301,8 +462,10 @@
 
 	sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev));
 
+	/* +1: we need extra space in the *sh->devs for the 'active spare' to keep
+	    handle_stripe() simple */
 	sc = kmem_cache_create(conf->cache_name, 
-			       sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
+			       sizeof(struct stripe_head)+(devs-1+1)*sizeof(struct r5dev),
 			       0, 0, NULL, NULL);
 	if (!sc)
 		return 1;
@@ -311,12 +474,12 @@
 		sh = kmem_cache_alloc(sc, GFP_KERNEL);
 		if (!sh)
 			return 1;
-		memset(sh, 0, sizeof(*sh) + (devs-1)*sizeof(struct r5dev));
+		memset(sh, 0, sizeof(*sh) + (devs-1+1)*sizeof(struct r5dev));
 		sh->raid_conf = conf;
 		spin_lock_init(&sh->lock);
 
-		if (grow_buffers(sh, conf->raid_disks)) {
-			shrink_buffers(sh, conf->raid_disks);
+		if (grow_buffers(sh, conf->raid_disks+1)) {
+			shrink_buffers(sh, conf->raid_disks+1);
 			kmem_cache_free(sc, sh);
 			return 1;
 		}
@@ -408,18 +571,40 @@
 			clear_bit(R5_ReWrite, &sh->dev[i].flags);
 		}
 	} else {
+		int keepon = 0;
+
 		clear_bit(R5_UPTODATE, &sh->dev[i].flags);
+	        /*
+		    rule 1.,: try to keep all disk in_sync even if we've got
+		    unfixable read errors, cause the 'active spare' may can
+		    rebuild a complete column from partially failed drives
+	        */
+		if (conf->disks[i].rdev->in_sync) {
+		    char b[BDEVNAME_SIZE];
+		    printk(KERN_ALERT
+			    "raid5_end_read_request: Read failure %s on sector %llu (%d) in %s mode\n",
+			    bdevname(conf->disks[i].rdev->bdev, b),
+			    (unsigned long long)sh->sector, atomic_read(&sh->count),
+			    conf->working_disks >= conf->raid_disks ? "optimal" : "degraded");
+		    keepon++;
+		}
 		if (conf->mddev->degraded) {
 			printk("R5: read error not correctable.\n");
 			clear_bit(R5_ReadError, &sh->dev[i].flags);
 			clear_bit(R5_ReWrite, &sh->dev[i].flags);
-			md_error(conf->mddev, conf->disks[i].rdev);
+			if (!keepon)
+			    md_error(conf->mddev, conf->disks[i].rdev);
+			else
+			    set_bit(R5_HardReadErr, &sh->dev[i].flags);
 		} else if (test_bit(R5_ReWrite, &sh->dev[i].flags)) {
 			/* Oh, no!!! */
 			printk("R5: read error NOT corrected!!\n");
 			clear_bit(R5_ReadError, &sh->dev[i].flags);
 			clear_bit(R5_ReWrite, &sh->dev[i].flags);
-			md_error(conf->mddev, conf->disks[i].rdev);
+			if (!keepon)
+			    md_error(conf->mddev, conf->disks[i].rdev);
+			else
+			    set_bit(R5_HardReadErr, &sh->dev[i].flags);
 		} else
 			set_bit(R5_ReadError, &sh->dev[i].flags);
 	}
@@ -457,13 +642,18 @@
 	PRINTK("end_write_request %llu/%d, count %d, uptodate: %d.\n", 
 		(unsigned long long)sh->sector, i, atomic_read(&sh->count),
 		uptodate);
+	/* sorry
 	if (i == disks) {
 		BUG();
 		return 0;
-	}
+	}*/
 
 	spin_lock_irqsave(&conf->device_lock, flags);
 	if (!uptodate)
+		/*  we must fail this drive, cause risks the integrity of data
+		    if this sector is readable. later, we could check
+		    is it this readable, if not, then we can handle it as a
+		    common badblock. */
 		md_error(conf->mddev, conf->disks[i].rdev);
 
 	rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
@@ -494,33 +684,154 @@
 	dev->req.bi_private = sh;
 
 	dev->flags = 0;
-	if (i != sh->pd_idx)
+	if (i != sh->pd_idx && i < sh->raid_conf->raid_disks)	/* active spare? */
 		dev->sector = compute_blocknr(sh, i);
 }
 
+static int raid5_remove_disk(mddev_t *mddev, int number);
+static int raid5_add_disk(mddev_t *mddev, mdk_rdev_t *rdev);
+/*static*/ void kick_rdev_from_array(mdk_rdev_t * rdev);
 static void error(mddev_t *mddev, mdk_rdev_t *rdev)
 {
 	char b[BDEVNAME_SIZE];
+	char b2[BDEVNAME_SIZE];
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
 	PRINTK("raid5: error called\n");
 
 	if (!rdev->faulty) {
-		mddev->sb_dirty = 1;
-		if (rdev->in_sync) {
-			conf->working_disks--;
-			mddev->degraded++;
-			conf->failed_disks++;
-			rdev->in_sync = 0;
-			/*
-			 * if recovery was running, make sure it aborts.
-			 */
-			set_bit(MD_RECOVERY_ERR, &mddev->recovery);
-		}
-		rdev->faulty = 1;
-		printk (KERN_ALERT
-			"raid5: Disk failure on %s, disabling device."
-			" Operation continuing on %d devices\n",
-			bdevname(rdev->bdev,b), conf->working_disks);
+		int mddisks = 0;
+		mdk_rdev_t *rd;
+		mdk_rdev_t *rdevs = NULL;
+		struct list_head *rtmp;
+		int i;
+
+		ITERATE_RDEV(mddev,rd,rtmp)
+		    {
+			printk("mddev%d: %s\n", mddisks, bdevname(rd->bdev,b));
+			mddisks++;
+		    }
+		for (i = 0; i < mddisks && (rd = conf->disks[i].rdev); i++) {
+			printk("r5dev%d: %s\n", i, bdevname(rd->bdev,b));
+		}
+		ITERATE_RDEV(mddev,rd,rtmp)
+		    {
+			rdevs = rd;
+			break;
+		    }
+printk("II %d %d > %d %d ins:%d %p\n",
+	mddev->raid_disks, mddisks, conf->raid_disks, mddev->degraded, rdev->in_sync, rdevs);
+		if (conf->disks[conf->raid_disks].rdev == rdev
+		    && conf->mirrorit != -1) {
+			/* in_sync, but must be handled specially, don't let 'degraded++' */
+		        printk (KERN_ALERT "active spare has failed %s (in_sync %d)\n",
+				    bdevname(rdev->bdev,b), rdev->in_sync);
+			mddev->sb_dirty = 1;
+			if (rdev->in_sync)
+				rdev->raid_disk = conf->raid_disks; /* me as myself, again ;) */
+		        rdev->in_sync = 0;
+			rdev->faulty = 1;
+		        conf->mirrorit = -1;
+		} else if (mddisks > conf->raid_disks && !mddev->degraded && rdev->in_sync) {
+		    /* have active spare, array is optimal, removed disk member
+			    of it (but not the active spare) */
+		    if (rdev->raid_disk == conf->mirrorit && conf->disks[conf->raid_disks].rdev) {
+			if (!conf->disks[conf->raid_disks].rdev->in_sync) {
+			    printk(KERN_ALERT "disk %s failed and active spare isn't in_sync yet, readd as normal spare\n",
+					bdevname(rdev->bdev,b));
+			    conf->mirrorit = -1;
+			    goto letitgo;
+			} else {
+			    int ret;
+
+			    /* hot replace the mirrored drive with the 'active spare'
+				this is really "hot", I can't see clearly the things
+				what I have to do here. :}
+				pray. */
+
+			    printk(KERN_ALERT "replace %s with in_sync active spare %s\n",
+				    bdevname(rdev->bdev,b),
+				    bdevname(rdevs->bdev,b2));
+			    rdev->in_sync = 0;
+			    rdev->faulty = 1;
+
+			    conf->mirrorit = -1;
+
+			    /* my God, am I sane? */
+			    while ((i = atomic_read(&rdev->nr_pending))) {
+				printk("waiting for disk %d .. %d\n",
+					rdev->raid_disk, i);
+			    }
+			    ret = raid5_remove_disk(mddev, rdev->raid_disk);
+			    if (ret) {
+				printk(KERN_ERR "raid5_remove_disk1: busy?!\n");
+				return;	// should nothing to do
+			    }
+
+			    rd = conf->disks[conf->raid_disks].rdev;
+			    while ((i = atomic_read(&rd->nr_pending))) {
+				printk("waiting for disk %d .. %d\n",
+					conf->raid_disks, i);
+			    }
+			    rd->in_sync = 0;
+			    ret = raid5_remove_disk(mddev, conf->raid_disks);
+			    if (ret) {
+				printk(KERN_ERR "raid5_remove_disk2: busy?!\n");
+				return;	// ..
+			    }
+
+			    ret = raid5_add_disk(mddev, rd);
+			    if (!ret) {
+				printk(KERN_ERR "raid5_add_disk: no free slot?!\n");
+				return;	// ..
+			    }
+			    rd->in_sync = 1;
+
+			    /* borrowed from hot_remove_disk() */
+			    kick_rdev_from_array(rdev);
+			    mddev->sb_dirty = 1;
+			}
+		    } else {
+			/* in_sync disk failed (!degraded), have a spare, starting
+			    proactive mirroring */
+			if (conf->mirrorit == -1) {
+				printk(KERN_ALERT "resync from %s to spare %s (%d)\n",
+					bdevname(rdev->bdev,b),
+			    		bdevname(rdevs->bdev,b2),
+					conf->raid_disks);
+
+				conf->mirrorit = rdev->raid_disk;
+
+				mddev->degraded++;	/* to call raid5_hot_add_disk(), reset there */
+			} else {
+				printk(KERN_ALERT "proactive mirroring is active, let this device go\n");
+				goto letitgo;
+			}
+		    }
+		} else {
+letitgo:
+		    mddev->sb_dirty = 1;
+		    if (rdev->in_sync) {
+			    conf->working_disks--;
+			    mddev->degraded++;
+			    conf->failed_disks++;
+			    rdev->in_sync = 0;
+			    /* error() was not called if the syncing was stopped by IO error */
+			    if (conf->mirrorit != -1 &&
+				!conf->disks[conf->raid_disks].rdev->in_sync) {
+				    printk(KERN_NOTICE "stop proactive mirroring\n");
+				    conf->mirrorit = -1;
+			    }
+			    /*
+			     * if recovery was running, make sure it aborts.
+			     */
+			    set_bit(MD_RECOVERY_ERR, &mddev->recovery);
+		    }
+		    rdev->faulty = 1;
+		    printk (KERN_ALERT
+			    "raid5: Disk failure on %s, disabling device."
+			    " Operation continuing on %d devices\n",
+			    bdevname(rdev->bdev,b), conf->working_disks);
+		}
 	}
 }	
 
@@ -896,6 +1207,74 @@
 }
 
 
+static int raid5_spare_active(mddev_t *mddev);
+
+static void raid5_eeh (mddev_t *mddev)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	int i = conf->mddev->eeh_data.failed_num;
+	struct disk_info *disk = &conf->disks[i];
+	char b[BDEVNAME_SIZE];
+	static char *envp[] = { "HOME=/",
+			    "TERM=linux",
+			    "PATH=/sbin:/usr/sbin:/bin:/usr/bin",
+			    NULL };
+	int ret;
+	int j;
+
+	/* suspend IO; todo: well, we should walk over on disks and waiting till
+	    (nr_pending > 0) */
+	printk("raid5_usereh active [%d, %x]\n", i, disk->rdev);
+
+	if (i < 0 || !disk->rdev) {
+		// fixme: why called on md_unregister?
+	        printk(KERN_ALERT "ERROR: !disk->rdev [%d]\n", i);
+		goto eeh_out;
+	}
+
+        if (mddev->degraded) {
+	        printk(KERN_ALERT "array is already degraded, don't kick this device\n");
+		goto eeh_out;
+	}
+
+	{
+	    char *argv[] = { "/sbin/mdevent", mdname(mddev), "drivefail",
+				bdevname(disk->rdev->bdev, b), NULL };
+	    ret = call_usermodehelper("/sbin/mdevent", argv, envp, 1/*wait*/);
+	    ret = ret >> 8;
+	    if (ret < 0 || ret > 1) {
+	    	printk(KERN_ALERT "/sbin/mdevent failed: %d\n", ret);
+	    	md_error(mddev, disk->rdev);
+	        /* (the raid5_remove_disk and raid5_add_disk wasn't called yet) */
+	    }
+	}
+
+	switch (ret) {
+	    case 1:		/* reset badblock cache (later: rewrite bad blocks?) */
+	        printk(KERN_INFO "resetting badblocks cache\n");
+	        for (j = 0; j < BB_NR_HASH; j++) {
+			struct badblock *bb, *bbprev = NULL;
+			bb = disk->badblock_hashtbl[j];
+			for (; bb; bb = bb->hash_next) {
+				if (bbprev)
+					kmem_cache_free(disk->slab_cache, bbprev);
+				bb_remove_hash(bb);
+				bbprev = bb;
+			}
+			if (bbprev)
+				kmem_cache_free(disk->slab_cache, bbprev);
+		}
+		break;
+	    default:
+		break;
+	}
+
+eeh_out:
+	mddev->eeh_data.failed_num = -1;	/* unregister me */
+	md_wakeup_thread(mddev->thread);
+	printk("raid5_usereh exited\n");
+}
+
 /*
  * handle_stripe - do things to a stripe.
  *
@@ -925,21 +1304,37 @@
 	int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
 	int non_overwrite = 0;
 	int failed_num=0;
+	int aspare=0, asparenum=-1;
+	struct disk_info *asparedev;
 	struct r5dev *dev;
 
 	PRINTK("handling stripe %llu, cnt=%d, pd_idx=%d\n",
 		(unsigned long long)sh->sector, atomic_read(&sh->count),
 		sh->pd_idx);
 
+	if (conf->mddev->eeh_thread) {
+		PRINTK("pass the stripe, eeh is active\n");
+		set_bit(STRIPE_HANDLE, &sh->state);
+	        return;
+	}
+
 	spin_lock(&sh->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
 
 	syncing = test_bit(STRIPE_SYNCING, &sh->state);
+	asparedev = &conf->disks[conf->raid_disks];
+	if (!conf->mddev->degraded && asparedev->rdev && !asparedev->rdev->faulty &&
+		conf->mirrorit != -1) {
+	    aspare++;
+	    asparenum = sh->raid_conf->mirrorit;
+	    PRINTK("has aspare (%d)\n", asparenum);
+	}
 	/* Now to look around and see what can be done */
 
-	for (i=disks; i--; ) {
+	for (i=disks+aspare; i--; ) {
 		mdk_rdev_t *rdev;
+		struct badblock *bb = NULL;
 		dev = &sh->dev[i];
 		clear_bit(R5_Insync, &dev->flags);
 		clear_bit(R5_Syncio, &dev->flags);
@@ -982,18 +1377,79 @@
 		}
 		if (dev->written) written++;
 		rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
+		if (rdev && rdev->in_sync &&
+		    !test_bit(R5_UPTODATE, &dev->flags) &&
+		    !test_bit(R5_LOCKED, &dev->flags)) {
+			/* ..potentially deserved to read, we must check it
+			    checkme, it could be a big performance penalty if called
+				without a good reason! it's seems ok for now
+			*/
+			PRINTK("find_badblock %d: %llu\n", i, sh->sector);
+			bb = find_badblock(&conf->disks[i], sh->sector);
+		}
 		if (!rdev || !rdev->in_sync) {
 			/* The ReadError flag wil just be confusing now */
 			clear_bit(R5_ReadError, &dev->flags);
 			clear_bit(R5_ReWrite, &dev->flags);
 		}
 		if (!rdev || !rdev->in_sync
-		    || test_bit(R5_ReadError, &dev->flags)) {
+		    || test_bit(R5_ReadError, &dev->flags) /*&& !test_bit(R5_UPTODATE, &dev->flags))*/
+		    || test_bit(R5_HardReadErr, &dev->flags)
+		    || bb) {
+			if (rdev && rdev->in_sync
+			    && !bb && test_bit(R5_HardReadErr, &dev->flags)) {
+				/* take an action only if it's a _new_ bad block
+				    and not while proactive mirroring is running */
+
+				if (!aspare || (aspare && asparedev->rdev->in_sync/*asparenum != i*/)) {
+				    /* if aspare is syncing we shouldn't register new
+					bad blocks, after the sync this disk will
+					be kicked anyway */
+
+				    if (test_bit(R5_HardReadErr, &dev->flags)) {
+					PRINTK("store_badblock %d: %llu\n", i, sh->sector);
+					store_badblock(&conf->disks[i], sh->sector);
+				    }
+
+				    if (count_badblocks(&conf->disks[i]) >= sysctl_badblock_tolerance) {
+					char b[BDEVNAME_SIZE];
+
+					printk(KERN_ALERT "too many badblocks (%lu) on device %s [%d]\n",
+						    count_badblocks(&conf->disks[i]) + 1, bdevname(conf->disks[i].rdev->bdev, b),
+						    atomic_read(&rdev->nr_pending));
+#ifndef USEREH
+					md_error(conf->mddev, conf->disks[i].rdev);
+#else
+					if (!conf->mddev->eeh_thread) {
+					    conf->mddev->eeh_thread = md_register_thread(raid5_eeh, conf->mddev, "%s_eeh");
+					    if (!conf->mddev->eeh_thread) {
+						printk(KERN_ERR 
+						    "raid5: couldn't allocate external error handler thread for %s\n",
+						    mdname(conf->mddev));
+						md_error(conf->mddev, conf->disks[i].rdev);
+					    } else  {
+						conf->mddev->eeh_data.failed_num = i;
+						md_wakeup_thread(conf->mddev->eeh_thread);
+					    }
+					}
+#endif
+				    }
+				}
+
+				// ha kozben volt masik IO es azt kapjuk elobb ide??
+				// hasonlo dolog van a rewrite-nal is, egy cipo
+				clear_bit(R5_HardReadErr, &dev->flags);
+			}
 			failed++;
 			failed_num = i;
-		} else
+			PRINTK("device %d failed for this stripe r%p w%p\n", i, dev->toread, dev->towrite);
+		} else {
 			set_bit(R5_Insync, &dev->flags);
+		}
 	}
+	if (aspare && failed > 1)
+	    failed--;	/* failed = 1 means "all ok" if we've aspare, this is simplest
+			    method to do our work */
 	PRINTK("locked=%d uptodate=%d to_read=%d"
 		" to_write=%d failed=%d failed_num=%d\n",
 		locked, uptodate, to_read, to_write, failed, failed_num);
@@ -1047,7 +1503,7 @@
 
 			/* fail any reads if this device is non-operational */
 			if (!test_bit(R5_Insync, &sh->dev[i].flags) ||
-			    test_bit(R5_ReadError, &sh->dev[i].flags)) {
+			    test_bit(R5_ReadError, &sh->dev[i].flags)) { // have meaning of this??
 				bi = sh->dev[i].toread;
 				sh->dev[i].toread = NULL;
 				if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
@@ -1070,6 +1526,8 @@
 		}
 	}
 	if (failed > 1 && syncing) {
+		printk(KERN_ALERT "sync stopped by IO error, marking the spare failed\n");
+		conf->disks[failed_num].rdev->faulty = 1;
 		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
 		clear_bit(STRIPE_SYNCING, &sh->state);
 		syncing = 0;
@@ -1249,6 +1707,26 @@
 					PRINTK("Writing block %d\n", i);
 					locked++;
 					set_bit(R5_Wantwrite, &sh->dev[i].flags);
+					if (aspare && i == asparenum) {
+					    char *ps, *pd;
+
+					    /* mirroring this new block */
+					    PRINTK("Writing to aspare too %d->%d\n",
+							i, conf->raid_disks);
+					    /*if (test_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags)) {
+						printk("bazmeg, ez lokkolt1!!!\n");
+					    }*/
+					    ps = page_address(sh->dev[i].page);
+					    pd = page_address(sh->dev[conf->raid_disks].page);
+					    /* better idea? */
+					    memcpy(pd, ps, STRIPE_SIZE);
+					    set_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags);
+					    set_bit(R5_Wantwrite, &sh->dev[conf->raid_disks].flags);
+					}
+					if (conf->disks[i].rdev && conf->disks[i].rdev->in_sync) {
+					    PRINTK("reset badblock on %d: %llu\n", i, sh->sector);
+					    delete_badblock(&conf->disks[i], sh->sector);
+					}
 					if (!test_bit(R5_Insync, &sh->dev[i].flags)
 					    || (i==sh->pd_idx && failed == 0))
 						set_bit(STRIPE_INSYNC, &sh->state);
@@ -1285,14 +1763,30 @@
 			if (failed==0)
 				failed_num = sh->pd_idx;
 			/* should be able to compute the missing block and write it to spare */
+			if (aspare)
+			    failed_num = asparenum;
 			if (!test_bit(R5_UPTODATE, &sh->dev[failed_num].flags)) {
 				if (uptodate+1 != disks)
 					BUG();
 				compute_block(sh, failed_num);
 				uptodate++;
 			}
+			if (aspare) {
+			    char *ps, *pd;
+
+			    ps = page_address(sh->dev[failed_num].page);
+			    pd = page_address(sh->dev[conf->raid_disks].page);
+			    memcpy(pd, ps, STRIPE_SIZE);
+			    PRINTK("R5_Wantwrite to aspare, uptodate: %d %p->%p\n",
+					uptodate, ps, pd);
+			    /*if (test_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags)) {
+				printk("bazmeg, ez lokkolt2!!!\n");
+			    }*/
+			}
 			if (uptodate != disks)
 				BUG();
+			if (aspare)
+			    failed_num = conf->raid_disks;
 			dev = &sh->dev[failed_num];
 			set_bit(R5_LOCKED, &dev->flags);
 			set_bit(R5_Wantwrite, &dev->flags);
@@ -1300,6 +1794,9 @@
 			locked++;
 			set_bit(STRIPE_INSYNC, &sh->state);
 			set_bit(R5_Syncio, &dev->flags);
+			/* !in_sync..
+			printk("reset badblock on %d: %llu\n", failed_num, sh->sector);
+			delete_badblock(&conf->disks[failed_num], sh->sector);*/
 		}
 	}
 	if (syncing && locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
@@ -1336,7 +1833,7 @@
 		bi->bi_size = 0;
 		bi->bi_end_io(bi, bytes, 0);
 	}
-	for (i=disks; i-- ;) {
+	for (i=disks+aspare; i-- ;) {
 		int rw;
 		struct bio *bi;
 		mdk_rdev_t *rdev;
@@ -1674,11 +2171,20 @@
 
 	md_check_recovery(mddev);
 
+	if (mddev->eeh_thread && mddev->eeh_data.failed_num == -1) {
+		printk(KERN_INFO "eeh_thread is done, unregistering\n");
+		md_unregister_thread(mddev->eeh_thread);
+		mddev->eeh_thread = NULL;
+	}
+
 	handled = 0;
 	spin_lock_irq(&conf->device_lock);
 	while (1) {
 		struct list_head *first;
 
+		if (mddev->eeh_thread)
+		    break;
+
 		if (conf->seq_flush - conf->seq_write > 0) {
 			int seq = conf->seq_flush;
 			bitmap_unplug(mddev->bitmap);
@@ -1733,11 +2239,11 @@
 	}
 
 	mddev->private = kmalloc (sizeof (raid5_conf_t)
-				  + mddev->raid_disks * sizeof(struct disk_info),
+				  + (mddev->raid_disks + 1) * sizeof(struct disk_info),
 				  GFP_KERNEL);
 	if ((conf = mddev->private) == NULL)
 		goto abort;
-	memset (conf, 0, sizeof (*conf) + mddev->raid_disks * sizeof(struct disk_info) );
+	memset (conf, 0, sizeof (*conf) + (mddev->raid_disks + 1) * sizeof(struct disk_info) );
 	conf->mddev = mddev;
 
 	if ((conf->stripe_hashtbl = (struct stripe_head **) __get_free_pages(GFP_ATOMIC, HASH_PAGES_ORDER)) == NULL)
@@ -1765,6 +2271,8 @@
 
 		disk->rdev = rdev;
 
+		grow_badblocks(disk);
+
 		if (rdev->in_sync) {
 			char b[BDEVNAME_SIZE];
 			printk(KERN_INFO "raid5: device %s operational as raid"
@@ -1775,6 +2283,8 @@
 	}
 
 	conf->raid_disks = mddev->raid_disks;
+	conf->mirrorit = -1;
+	mddev->eeh_thread = NULL;	/* just to be sure */
 	/*
 	 * 0 for a fully functional array, 1 for a degraded array.
 	 */
@@ -1825,7 +2335,7 @@
 		}
 	}
 memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
-		 conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
+		 (conf->raid_disks+1) * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
 	if (grow_stripes(conf, conf->max_nr_stripes)) {
 		printk(KERN_ERR 
 			"raid5: couldn't allocate %dkB for buffers\n", memory);
@@ -1887,10 +2397,19 @@
 static int stop (mddev_t *mddev)
 {
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
+	int i;
 
+	/* may blocked in user-space, kill it */
+	if (mddev->eeh_thread) {
+		md_unregister_thread(mddev->eeh_thread);
+		mddev->eeh_thread = NULL;
+	}
 	md_unregister_thread(mddev->thread);
 	mddev->thread = NULL;
 	shrink_stripes(conf);
+	for (i = conf->raid_disks; i--; )
+		if (conf->disks[i].rdev && conf->disks[i].rdev->in_sync)
+			shrink_badblocks(&conf->disks[i]);
 	free_pages((unsigned long) conf->stripe_hashtbl, HASH_PAGES_ORDER);
 	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
 	kfree(conf);
@@ -1936,7 +2455,9 @@
 static void status (struct seq_file *seq, mddev_t *mddev)
 {
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
-	int i;
+	int i, j;
+	char b[BDEVNAME_SIZE];
+	struct badblock *bb;
 
 	seq_printf (seq, " level %d, %dk chunk, algorithm %d", mddev->level, mddev->chunk_size >> 10, mddev->layout);
 	seq_printf (seq, " [%d/%d] [", conf->raid_disks, conf->working_disks);
@@ -1949,6 +2470,20 @@
 #define D(x) \
 	seq_printf (seq, "<"#x":%d>", atomic_read(&conf->x))
 	printall(conf);
+
+	spin_lock_irq(&conf->device_lock);	/* it's ok now for debug */
+	seq_printf (seq, "\n      known bad sectors on active devices:");
+	for (i = conf->raid_disks; i--; ) {
+	    if (conf->disks[i].rdev) {
+		seq_printf (seq, "\n      %s", bdevname(conf->disks[i].rdev->bdev, b));
+		for (j = 0; j < BB_NR_HASH; j++) {
+		    bb = conf->disks[i].badblock_hashtbl[j];
+		    for (; bb; bb = bb->hash_next)
+			seq_printf (seq, " %llu-%llu", bb->sector, bb->sector + (unsigned long long)(STRIPE_SIZE / 512) - 1);
+		}
+	    }
+	}
+	spin_unlock_irq(&conf->device_lock);
 #endif
 }
 
@@ -1992,6 +2527,16 @@
 			tmp->rdev->in_sync = 1;
 		}
 	}
+	tmp = conf->disks + i;
+	if (tmp->rdev && !tmp->rdev->faulty && !tmp->rdev->in_sync) {
+	    tmp->rdev->in_sync = 1;
+
+	    printk(KERN_NOTICE "raid5_spare_active: %d in_sync %d->%d\n",
+			i, tmp->rdev->raid_disk, conf->mirrorit);
+
+	    /* scary..? :} */
+	    tmp->rdev->raid_disk = conf->mirrorit;
+	}
 	print_raid5_conf(conf);
 	return 0;
 }
@@ -2005,6 +2550,7 @@
 
 	print_raid5_conf(conf);
 	rdev = p->rdev;
+printk("raid5_remove_disk %d\n", number);
 	if (rdev) {
 		if (rdev->in_sync ||
 		    atomic_read(&rdev->nr_pending)) {
@@ -2018,6 +2564,17 @@
 			err = -EBUSY;
 			p->rdev = rdev;
 		}
+		if (!err) {
+			shrink_badblocks(p);
+
+			/* stopped by IO error.. */
+			if (conf->mirrorit != -1
+			    && conf->disks[conf->raid_disks].rdev == NULL) {
+				printk(KERN_INFO "raid5_remove_disk: IO error on proactive mirroring of %d!\n",
+					    conf->mirrorit);
+				conf->mirrorit = -1;
+			}
+		}
 	}
 abort:
 
@@ -2049,6 +2606,29 @@
 			p->rdev = rdev;
 			break;
 		}
+
+	if (!found && conf->disks[disk].rdev == NULL) {
+	    char b[BDEVNAME_SIZE];
+
+	    /* array optimal, this should be the 'active spare' added by eeh_thread/error() */
+	    conf->disks[disk].rdev = rdev;
+	    rdev->in_sync = 0;
+	    rdev->raid_disk = conf->raid_disks;
+	    conf->fullsync = 1;
+
+	    if (mddev->degraded) /* if we're here and it's true, we're called after error() */
+		    mddev->degraded--;
+	    else
+		    conf->mirrorit = mddev->eeh_data.failed_num;
+	    found = 1;
+
+	    printk(KERN_NOTICE "added spare for proactive replacement of %s\n",
+		    bdevname(conf->disks[conf->mirrorit].rdev->bdev, b));
+	}
+	if (found)
+		grow_badblocks(&conf->disks[disk]);
+	printk(KERN_INFO "raid5_add_disk: %d (%d) in_sync: %d\n", disk, found, found ? rdev->in_sync : -1);
+
 	print_raid5_conf(conf);
 	return found;
 }