*biiiiig* thanks for all developers of bitmap-based raid5 resyncing and bad block rewriting, both of them is really great feature! :) I ported my "proactive thing" to the new kernel, it can live nicely with the new features now.. some bugs hunted in the last month, it's quite stable for me if you wanna give a try you must apply Neil's badblock rewriting patch first, attached also - *readme* - This is a feature patch that implements 'proactive raid5 disk replacement' (http://www.arctic.org/~dean/raid-wishlist.html), that could help a lot on large raid5 arrays built from cheap sata drivers when the IO traffic such large that daily media scan on the disks isn't possible. An atypical breakdown situation is when a drive gets kicked from the array due to a bad block, I replace it but the resync fails cause another 2-3 disks has hidden badblocks too. In this situation I've to save the disks with dd and rebuild bad blocks with a userspace tool (by hand), meanwhile the site is down for hours. This patch tries to give a solution for this problem, the two main feature is: 1. Don't kick a drive on read error cause it is possible that 99.99% is useable and will help (to serve and to save data) if another drive shows bad sectors in same array - Neil's new (experimental) sector rewrite feature included, the first thing is always a try to rewrite the bad sector 2. Let to mirror a partially failed drive to a spare _online_ and replace the source of the mirror with the spare when it's done. Bad blocks isn't a problem unless same stripe is damaged on two disks what's a rare case. In this way is possible to fix an array with partially failed drives without data loss and without downtime. In other words, you never have to degrade the array due to a disk change, you can do that in optimal state. Per-device bad block cache is implemented to speed up arrays with partially failed drives (replies are often slow from those), also helps to determine badly damaged drives based on number of bad blocks, and can take an action if steps over an user defined threshold (see /proc/sys/dev/raid/badblock_tolerance). Rewrite of a bad block will delete the entry from the cache. performance is affected just a little bit if there's no or some registered bad blocks, but over a million that could be a problem currently.. Some words about error handling: first big change is now you can use an external error handler, what means that a user-space script will be called by the kernel to handle the situation. The common method in this script is to call 'mdadm' and choosing return values (see below). This is good -for example- if you have 1 spare drive for 2 arrays, a script can handle it nicely.. If the script has failed to run (or does not exists), there's a default algorithm, the main guidelines of that: - a "disk fail" means that it's oversteps the 'badblock threshold' or failed on write - if a drive fails in an optimal array and there's no spare the disk will be kicked from the array - if the drive fails in degraded array the drive _won't_ be kicked. processes gets read/write error if data is needed from the damaged sectors. if you want the old behavior use an external error handler - if drive fails and there's a spare then the proactive mirroring begins to the spare. the failing drive won't be kicked until the mirror has not been done Well, better if you know, It's an ugly hack, I'm not a kernel guru, but I love the idea and now I can't live without it on my own servers (so this is works for me). I hope somebody will implement this feature once in a much nicer adaptation, I'm trying to maintain this patch till then.. You should put your external error handler script at location "/sbin/mdevent"; it gets the following arguments: 1st: name of the md array (eg.: "md0") 2nd: kind of the fail event as string, currently always "drivefail" 3rd: name of the drive (maybe major/minor nr would be better, currently you can translate to that by /proc/partitions) Let's see how can you handle some situations from the script: array is optimal, a disk fails: you want to.. fail that drive and add a spare for normal rebuilding mdadm -f /dev/$2 /dev/$3 mdadm -a /dev/$2 /dev/my_spare1 exit 0 ..start proactive mirroring of that disk mdadm -a /dev/$2 /dev/my_spare1 exit 0 ..keep it on and reset the badblock cache exit 1 ..just keep it in sync exit 0 ..let the default action exit 2 Notice that if the proactive mirroring is done the spare won't replace the source drive automatically, you should do it by hand or by a scheluded task. You've got a last chance to re-think it. (raid6 could be another solution for this problem, but that's the big far evil in my eyes ;) use: 1. patch the kernel, this one is against 2.6.14 2. type: # make the drives mdadm -B -n1 -l faulty -c4 /dev/md/1 /dev/rd/0 mdadm -B -n1 -l faulty -c4 /dev/md/2 /dev/rd/1 mdadm -B -n1 -l faulty -c4 /dev/md/3 /dev/rd/2 # make the array mdadm -C -n3 -l5 /dev/md/0 /dev/md/1 /dev/md/2 /dev/md/3 # .. wait for sync .. # grow bad blocks as ma*tor does :) mdadm --grow -l faulty -p rp454 /dev/md/1 mdadm --grow -l faulty -p rp738 /dev/md/2 # add a spare mdadm -a /dev/md/0 /dev/rd/4 # -> fail a drive, sync begins <- # the md/1 will not be marked as failed, this is the point, but # if you want to, you can issue this command again mdadm -f /dev/md/0 /dev/md/1 # kernel: # resync from md1 to spare ram4 # added spare for active resync # .. wonder the read errors from md[12] and the sync goes on! # feel free to stress the md at this time, mkfs, dd, badblocks, etc # kernel: # raid5_spare_active: 3 in_sync 3->0 # /proc/mdstat: # md0 : active raid5 ram4[0] md3[2] md2[1] md1[0] # -> ram4 and md1 has same id, this means the spare is a complete mirror, # if you stop the array you can assembly it with ram4 instead of md1, # the superblock is same on them # check the mirror (stop write stress if any) mdadm --grow -l faulty -p none /dev/md/1 cmp /dev/md/1 /dev/rd/4 # hot-replace the mirrored -partially failed- device with the active spare # (yes, mark it as failed again, but if there's a syncing- or synced 'active spare' # the -f really fails the device or replace it with the synced spare) mdadm -f /dev/md/0 /dev/md/1 # kernel: # replace md1 with in_sync active spare ram4 # and voila! # /proc/mdstat: # md0 : active raid5 ram4[0] md3[2] md2[1] -- dap
diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~ 2005-09-16 12:21:24.000000000 +1000 +++ ./drivers/md/raid5.c 2005-09-16 12:57:12.000000000 +1000 @@ -349,7 +349,7 @@ static void shrink_stripes(raid5_conf_t conf->slab_cache = NULL; } -static int raid5_end_read_request (struct bio * bi, unsigned int bytes_done, +static int raid5_end_read_request(struct bio * bi, unsigned int bytes_done, int error) { struct stripe_head *sh = bi->bi_private; @@ -401,10 +401,27 @@ static int raid5_end_read_request (struc } #else set_bit(R5_UPTODATE, &sh->dev[i].flags); -#endif +#endif + if (test_bit(R5_ReadError, &sh->dev[i].flags)) { + printk("R5: read error corrected!!\n"); + clear_bit(R5_ReadError, &sh->dev[i].flags); + clear_bit(R5_ReWrite, &sh->dev[i].flags); + } } else { - md_error(conf->mddev, conf->disks[i].rdev); clear_bit(R5_UPTODATE, &sh->dev[i].flags); + if (conf->mddev->degraded) { + printk("R5: read error not correctable.\n"); + clear_bit(R5_ReadError, &sh->dev[i].flags); + clear_bit(R5_ReWrite, &sh->dev[i].flags); + md_error(conf->mddev, conf->disks[i].rdev); + } else if (test_bit(R5_ReWrite, &sh->dev[i].flags)) { + /* Oh, no!!! */ + printk("R5: read error NOT corrected!!\n"); + clear_bit(R5_ReadError, &sh->dev[i].flags); + clear_bit(R5_ReWrite, &sh->dev[i].flags); + md_error(conf->mddev, conf->disks[i].rdev); + } else + set_bit(R5_ReadError, &sh->dev[i].flags); } rdev_dec_pending(conf->disks[i].rdev, conf->mddev); #if 0 @@ -966,6 +983,12 @@ static void handle_stripe(struct stripe_ if (dev->written) written++; rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */ if (!rdev || !rdev->in_sync) { + /* The ReadError flag wil just be confusing now */ + clear_bit(R5_ReadError, &dev->flags); + clear_bit(R5_ReWrite, &dev->flags); + } + if (!rdev || !rdev->in_sync + || test_bit(R5_ReadError, &dev->flags)) { failed++; failed_num = i; } else @@ -980,6 +1003,14 @@ static void handle_stripe(struct stripe_ if (failed > 1 && to_read+to_write+written) { for (i=disks; i--; ) { int bitmap_end = 0; + + if (test_bit(R5_ReadError, &sh->dev[i].flags)) { + mdk_rdev_t *rdev = conf->disks[i].rdev; + if (rdev && rdev->in_sync) + /* multiple read failures in one stripe */ + md_error(conf->mddev, rdev); + } + spin_lock_irq(&conf->device_lock); /* fail all writes first */ bi = sh->dev[i].towrite; @@ -1015,7 +1046,8 @@ static void handle_stripe(struct stripe_ } /* fail any reads if this device is non-operational */ - if (!test_bit(R5_Insync, &sh->dev[i].flags)) { + if (!test_bit(R5_Insync, &sh->dev[i].flags) || + test_bit(R5_ReadError, &sh->dev[i].flags)) { bi = sh->dev[i].toread; sh->dev[i].toread = NULL; if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) @@ -1274,7 +1306,26 @@ static void handle_stripe(struct stripe_ md_done_sync(conf->mddev, STRIPE_SECTORS,1); clear_bit(STRIPE_SYNCING, &sh->state); } - + + /* If the failed drive is just a ReadError, then we might need to progress + * the repair/check process + */ + if (failed == 1 && test_bit(R5_ReadError, &sh->dev[failed_num].flags) + && !test_bit(R5_LOCKED, &sh->dev[failed_num].flags) + && test_bit(R5_UPTODATE, &sh->dev[failed_num].flags) + ) { + dev = &sh->dev[failed_num]; + if (!test_bit(R5_ReWrite, &dev->flags)) { + set_bit(R5_Wantwrite, &dev->flags); + set_bit(R5_ReWrite, &dev->flags); + set_bit(R5_LOCKED, &dev->flags); + } else { + /* let's read it back */ + set_bit(R5_Wantread, &dev->flags); + set_bit(R5_LOCKED, &dev->flags); + } + } + spin_unlock(&sh->lock); while ((bi=return_bi)) { diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h --- ./include/linux/raid/raid5.h~current~ 2005-09-16 12:21:24.000000000 +1000 +++ ./include/linux/raid/raid5.h 2005-09-16 12:55:51.000000000 +1000 @@ -154,6 +154,8 @@ struct stripe_head { #define R5_Wantwrite 5 #define R5_Syncio 6 /* this io need to be accounted as resync io */ #define R5_Overlap 7 /* There is a pending overlapping request on this block */ +#define R5_ReadError 8 /* seen a read error here recently */ +#define R5_ReWrite 9 /* have tried to over-write the readerror */ /* * Write method
--- linux/include/linux/sysctl.h.orig 2005-11-08 14:41:06.000000000 +0100 +++ linux/include/linux/sysctl.h 2005-11-09 20:08:51.000000000 +0100 @@ -758,7 +758,8 @@ /* /proc/sys/dev/raid */ enum { DEV_RAID_SPEED_LIMIT_MIN=1, - DEV_RAID_SPEED_LIMIT_MAX=2 + DEV_RAID_SPEED_LIMIT_MAX=2, + DEV_RAID_BADBLOCK_TOLERANCE=3 }; /* /proc/sys/dev/parport/default */ --- linux/include/linux/raid/md_k.h.orig 2005-10-28 02:02:08.000000000 +0200 +++ linux/include/linux/raid/md_k.h 2005-11-09 20:06:02.000000000 +0100 @@ -165,6 +165,11 @@ char uuid[16]; struct mdk_thread_s *thread; /* management thread */ + struct mdk_thread_s *eeh_thread; /* external error handler */ + struct eeh_data { + int failed_num; /* drive # */ + } eeh_data; + struct mdk_thread_s *sync_thread; /* doing resync or reconstruct */ sector_t curr_resync; /* blocks scheduled */ unsigned long resync_mark; /* a recent timestamp */ --- linux/include/linux/raid/raid5.h.orig 2005-11-08 18:26:48.000000000 +0100 +++ linux/include/linux/raid/raid5.h 2005-11-09 22:27:58.000000000 +0100 @@ -156,6 +156,7 @@ #define R5_Overlap 7 /* There is a pending overlapping request on this block */ #define R5_ReadError 8 /* seen a read error here recently */ #define R5_ReWrite 9 /* have tried to over-write the readerror */ +#define R5_HardReadErr 10 /* rewrite failed, put into badblocks list */ /* * Write method @@ -200,8 +201,16 @@ */ +struct badblock { + struct badblock *hash_next, **hash_pprev; /* hash pointers */ + sector_t sector; /* stripe # */ +}; + struct disk_info { mdk_rdev_t *rdev; + struct badblock **badblock_hashtbl; /* list of known badblocks */ + char cache_name[20]; + kmem_cache_t *slab_cache; /* badblock db */ }; struct raid5_private_data { @@ -238,6 +247,8 @@ int inactive_blocked; /* release of inactive stripes blocked, * waiting for 25% to be free */ + int mirrorit; /* source for active spare resync */ + spinlock_t device_lock; struct disk_info disks[0]; }; --- linux/drivers/md/md.c.orig 2005-10-28 02:02:08.000000000 +0200 +++ linux/drivers/md/md.c 2005-11-09 20:18:39.000000000 +0100 @@ -85,6 +85,10 @@ static int sysctl_speed_limit_min = 1000; static int sysctl_speed_limit_max = 200000; +/* the drive'll be marked failed over this threshold. measure is block. */ +int sysctl_badblock_tolerance = 10000; + + static struct ctl_table_header *raid_table_header; static ctl_table raid_table[] = { @@ -104,6 +108,14 @@ .mode = 0644, .proc_handler = &proc_dointvec, }, + { + .ctl_name = DEV_RAID_BADBLOCK_TOLERANCE, + .procname = "badblock_tolerance", + .data = &sysctl_badblock_tolerance, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, { .ctl_name = 0 } }; @@ -4097,6 +4109,8 @@ EXPORT_SYMBOL(md_wakeup_thread); EXPORT_SYMBOL(md_print_devices); EXPORT_SYMBOL(md_check_recovery); +EXPORT_SYMBOL(kick_rdev_from_array); // fixme +EXPORT_SYMBOL(sysctl_badblock_tolerance); MODULE_LICENSE("GPL"); MODULE_ALIAS("md"); MODULE_ALIAS_BLOCKDEV_MAJOR(MD_MAJOR); --- linux/drivers/md/raid5.c.orig 2005-11-08 18:26:48.000000000 +0100 +++ linux/drivers/md/raid5.c 2005-11-10 02:32:52.000000000 +0100 @@ -42,6 +42,18 @@ #define stripe_hash(conf, sect) ((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]) + /* + * per-device badblock cache + */ + +#define BB_SHIFT (PAGE_SHIFT/*12*/ - 9) +#define BB_HASH_PAGES 1 +#define BB_NR_HASH (HASH_PAGES * PAGE_SIZE / sizeof(struct badblock *)) +#define BB_HASH_MASK (BB_NR_HASH - 1) + +#define bb_hash(disk, sect) ((disk)->badblock_hashtbl[((sect) >> BB_SHIFT) & BB_HASH_MASK]) +#define bb_hashnr(sect) (((sect) >> BB_SHIFT) & BB_HASH_MASK) + /* bio's attached to a stripe+device for I/O are linked together in bi_sector * order without overlap. There may be several bio's per stripe+device, and * a bio could span several devices. @@ -55,7 +67,7 @@ /* * The following can be used to debug the driver */ -#define RAID5_DEBUG 0 +#define RAID5_DEBUG 1 #define RAID5_PARANOIA 1 #if RAID5_PARANOIA && defined(CONFIG_SMP) # define CHECK_DEVLOCK() assert_spin_locked(&conf->device_lock) @@ -63,13 +75,162 @@ # define CHECK_DEVLOCK() #endif -#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(x))) +/* use External Error Handler? */ +#define USEREH 1 + +#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(KERN_DEBUG x))) #if RAID5_DEBUG #define inline #define __inline__ #endif static void print_raid5_conf (raid5_conf_t *conf); +extern int sysctl_badblock_tolerance; + + +static void bb_insert_hash(struct disk_info *disk, struct badblock *bb) +{ + struct badblock **bbp = &bb_hash(disk, bb->sector); + + /*printk("bb_insert_hash(), sector %llu hashnr %lu\n", (unsigned long long)bb->sector, + bb_hashnr(bb->sector));*/ + + if ((bb->hash_next = *bbp) != NULL) + (*bbp)->hash_pprev = &bb->hash_next; + *bbp = bb; + bb->hash_pprev = bbp; +} + +static void bb_remove_hash(struct badblock *bb) +{ + /*printk("remove_hash(), sector %llu hashnr %lu\n", (unsigned long long)bb->sector, + bb_hashnr(bb->sector));*/ + + if (bb->hash_pprev) { + if (bb->hash_next) + bb->hash_next->hash_pprev = bb->hash_pprev; + *bb->hash_pprev = bb->hash_next; + bb->hash_pprev = NULL; + } +} + +static struct badblock *__find_badblock(struct disk_info *disk, sector_t sector) +{ + struct badblock *bb; + + for (bb = bb_hash(disk, sector); bb; bb = bb->hash_next) + if (bb->sector == sector) + return bb; + return NULL; +} + +static struct badblock *find_badblock(struct disk_info *disk, sector_t sector) +{ + raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private; + struct badblock *bb; + + spin_lock_irq(&conf->device_lock); + bb = __find_badblock(disk, sector); + spin_unlock_irq(&conf->device_lock); + return bb; +} + +static unsigned long count_badblocks (struct disk_info *disk) +{ + raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private; + struct badblock *bb; + int j; + int n = 0; + + spin_lock_irq(&conf->device_lock); + for (j = 0; j < BB_NR_HASH; j++) { + bb = disk->badblock_hashtbl[j]; + for (; bb; bb = bb->hash_next) + n++; + } + spin_unlock_irq(&conf->device_lock); + + return n; +} + +static int grow_badblocks(struct disk_info *disk) +{ + char b[BDEVNAME_SIZE]; + kmem_cache_t *sc; + + /* hash table */ + if ((disk->badblock_hashtbl = (struct badblock **) __get_free_pages(GFP_ATOMIC, HASH_PAGES_ORDER)) == NULL) { + printk("grow_badblocks: __get_free_pages failed\n"); + return 0; + } + memset(disk->badblock_hashtbl, 0, BB_HASH_PAGES * PAGE_SIZE); + + /* badblocks db */ + sprintf(disk->cache_name, "raid5/%s_%s_bbc", mdname(disk->rdev->mddev), + bdevname(disk->rdev->bdev, b)); + sc = kmem_cache_create(disk->cache_name, + sizeof(struct badblock), + 0, 0, NULL, NULL); + if (!sc) { + printk("grow_badblocks: kmem_cache_create failed\n"); + return 1; + } + disk->slab_cache = sc; + + return 0; +} + +static void shrink_badblocks(struct disk_info *disk) +{ + struct badblock *bb; + int j; + + /* badblocks db */ + for (j = 0; j < BB_NR_HASH; j++) { + bb = disk->badblock_hashtbl[j]; + for (; bb; bb = bb->hash_next) + kmem_cache_free(disk->slab_cache, bb); + } + kmem_cache_destroy(disk->slab_cache); + disk->slab_cache = NULL; + + /* hash table */ + free_pages((unsigned long) disk->badblock_hashtbl, HASH_PAGES_ORDER); +} + +static void store_badblock(struct disk_info *disk, sector_t sector) +{ + struct badblock *bb; + raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private; + + bb = kmem_cache_alloc(disk->slab_cache, GFP_KERNEL); + if (!bb) { + printk("store_badblock: kmem_cache_alloc failed\n"); + return; + } + memset(bb, 0, sizeof(*bb)); + bb->sector = sector; + + spin_lock_irq(&conf->device_lock); + bb_insert_hash(disk, bb); + spin_unlock_irq(&conf->device_lock); +} + +static void delete_badblock(struct disk_info *disk, sector_t sector) +{ + struct badblock *bb; + raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private; + + bb = find_badblock(disk, sector); + if (!bb) + /* reset on write'll call us like an idiot :} */ + return; + spin_lock_irq(&conf->device_lock); + bb_remove_hash(bb); + kmem_cache_free(disk->slab_cache, bb); + spin_unlock_irq(&conf->device_lock); +} + static inline void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh) { @@ -208,7 +369,7 @@ sh->pd_idx = pd_idx; sh->state = 0; - for (i=disks; i--; ) { + for (i=disks+1; i--; ) { struct r5dev *dev = &sh->dev[i]; if (dev->toread || dev->towrite || dev->written || @@ -301,8 +462,10 @@ sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev)); + /* +1: we need extra space in the *sh->devs for the 'active spare' to keep + handle_stripe() simple */ sc = kmem_cache_create(conf->cache_name, - sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev), + sizeof(struct stripe_head)+(devs-1+1)*sizeof(struct r5dev), 0, 0, NULL, NULL); if (!sc) return 1; @@ -311,12 +474,12 @@ sh = kmem_cache_alloc(sc, GFP_KERNEL); if (!sh) return 1; - memset(sh, 0, sizeof(*sh) + (devs-1)*sizeof(struct r5dev)); + memset(sh, 0, sizeof(*sh) + (devs-1+1)*sizeof(struct r5dev)); sh->raid_conf = conf; spin_lock_init(&sh->lock); - if (grow_buffers(sh, conf->raid_disks)) { - shrink_buffers(sh, conf->raid_disks); + if (grow_buffers(sh, conf->raid_disks+1)) { + shrink_buffers(sh, conf->raid_disks+1); kmem_cache_free(sc, sh); return 1; } @@ -408,18 +571,40 @@ clear_bit(R5_ReWrite, &sh->dev[i].flags); } } else { + int keepon = 0; + clear_bit(R5_UPTODATE, &sh->dev[i].flags); + /* + rule 1.,: try to keep all disk in_sync even if we've got + unfixable read errors, cause the 'active spare' may can + rebuild a complete column from partially failed drives + */ + if (conf->disks[i].rdev->in_sync) { + char b[BDEVNAME_SIZE]; + printk(KERN_ALERT + "raid5_end_read_request: Read failure %s on sector %llu (%d) in %s mode\n", + bdevname(conf->disks[i].rdev->bdev, b), + (unsigned long long)sh->sector, atomic_read(&sh->count), + conf->working_disks >= conf->raid_disks ? "optimal" : "degraded"); + keepon++; + } if (conf->mddev->degraded) { printk("R5: read error not correctable.\n"); clear_bit(R5_ReadError, &sh->dev[i].flags); clear_bit(R5_ReWrite, &sh->dev[i].flags); - md_error(conf->mddev, conf->disks[i].rdev); + if (!keepon) + md_error(conf->mddev, conf->disks[i].rdev); + else + set_bit(R5_HardReadErr, &sh->dev[i].flags); } else if (test_bit(R5_ReWrite, &sh->dev[i].flags)) { /* Oh, no!!! */ printk("R5: read error NOT corrected!!\n"); clear_bit(R5_ReadError, &sh->dev[i].flags); clear_bit(R5_ReWrite, &sh->dev[i].flags); - md_error(conf->mddev, conf->disks[i].rdev); + if (!keepon) + md_error(conf->mddev, conf->disks[i].rdev); + else + set_bit(R5_HardReadErr, &sh->dev[i].flags); } else set_bit(R5_ReadError, &sh->dev[i].flags); } @@ -457,13 +642,18 @@ PRINTK("end_write_request %llu/%d, count %d, uptodate: %d.\n", (unsigned long long)sh->sector, i, atomic_read(&sh->count), uptodate); + /* sorry if (i == disks) { BUG(); return 0; - } + }*/ spin_lock_irqsave(&conf->device_lock, flags); if (!uptodate) + /* we must fail this drive, cause risks the integrity of data + if this sector is readable. later, we could check + is it this readable, if not, then we can handle it as a + common badblock. */ md_error(conf->mddev, conf->disks[i].rdev); rdev_dec_pending(conf->disks[i].rdev, conf->mddev); @@ -494,33 +684,154 @@ dev->req.bi_private = sh; dev->flags = 0; - if (i != sh->pd_idx) + if (i != sh->pd_idx && i < sh->raid_conf->raid_disks) /* active spare? */ dev->sector = compute_blocknr(sh, i); } +static int raid5_remove_disk(mddev_t *mddev, int number); +static int raid5_add_disk(mddev_t *mddev, mdk_rdev_t *rdev); +/*static*/ void kick_rdev_from_array(mdk_rdev_t * rdev); static void error(mddev_t *mddev, mdk_rdev_t *rdev) { char b[BDEVNAME_SIZE]; + char b2[BDEVNAME_SIZE]; raid5_conf_t *conf = (raid5_conf_t *) mddev->private; PRINTK("raid5: error called\n"); if (!rdev->faulty) { - mddev->sb_dirty = 1; - if (rdev->in_sync) { - conf->working_disks--; - mddev->degraded++; - conf->failed_disks++; - rdev->in_sync = 0; - /* - * if recovery was running, make sure it aborts. - */ - set_bit(MD_RECOVERY_ERR, &mddev->recovery); - } - rdev->faulty = 1; - printk (KERN_ALERT - "raid5: Disk failure on %s, disabling device." - " Operation continuing on %d devices\n", - bdevname(rdev->bdev,b), conf->working_disks); + int mddisks = 0; + mdk_rdev_t *rd; + mdk_rdev_t *rdevs = NULL; + struct list_head *rtmp; + int i; + + ITERATE_RDEV(mddev,rd,rtmp) + { + printk("mddev%d: %s\n", mddisks, bdevname(rd->bdev,b)); + mddisks++; + } + for (i = 0; i < mddisks && (rd = conf->disks[i].rdev); i++) { + printk("r5dev%d: %s\n", i, bdevname(rd->bdev,b)); + } + ITERATE_RDEV(mddev,rd,rtmp) + { + rdevs = rd; + break; + } +printk("II %d %d > %d %d ins:%d %p\n", + mddev->raid_disks, mddisks, conf->raid_disks, mddev->degraded, rdev->in_sync, rdevs); + if (conf->disks[conf->raid_disks].rdev == rdev + && conf->mirrorit != -1) { + /* in_sync, but must be handled specially, don't let 'degraded++' */ + printk (KERN_ALERT "active spare has failed %s (in_sync %d)\n", + bdevname(rdev->bdev,b), rdev->in_sync); + mddev->sb_dirty = 1; + if (rdev->in_sync) + rdev->raid_disk = conf->raid_disks; /* me as myself, again ;) */ + rdev->in_sync = 0; + rdev->faulty = 1; + conf->mirrorit = -1; + } else if (mddisks > conf->raid_disks && !mddev->degraded && rdev->in_sync) { + /* have active spare, array is optimal, removed disk member + of it (but not the active spare) */ + if (rdev->raid_disk == conf->mirrorit && conf->disks[conf->raid_disks].rdev) { + if (!conf->disks[conf->raid_disks].rdev->in_sync) { + printk(KERN_ALERT "disk %s failed and active spare isn't in_sync yet, readd as normal spare\n", + bdevname(rdev->bdev,b)); + conf->mirrorit = -1; + goto letitgo; + } else { + int ret; + + /* hot replace the mirrored drive with the 'active spare' + this is really "hot", I can't see clearly the things + what I have to do here. :} + pray. */ + + printk(KERN_ALERT "replace %s with in_sync active spare %s\n", + bdevname(rdev->bdev,b), + bdevname(rdevs->bdev,b2)); + rdev->in_sync = 0; + rdev->faulty = 1; + + conf->mirrorit = -1; + + /* my God, am I sane? */ + while ((i = atomic_read(&rdev->nr_pending))) { + printk("waiting for disk %d .. %d\n", + rdev->raid_disk, i); + } + ret = raid5_remove_disk(mddev, rdev->raid_disk); + if (ret) { + printk(KERN_ERR "raid5_remove_disk1: busy?!\n"); + return; // should nothing to do + } + + rd = conf->disks[conf->raid_disks].rdev; + while ((i = atomic_read(&rd->nr_pending))) { + printk("waiting for disk %d .. %d\n", + conf->raid_disks, i); + } + rd->in_sync = 0; + ret = raid5_remove_disk(mddev, conf->raid_disks); + if (ret) { + printk(KERN_ERR "raid5_remove_disk2: busy?!\n"); + return; // .. + } + + ret = raid5_add_disk(mddev, rd); + if (!ret) { + printk(KERN_ERR "raid5_add_disk: no free slot?!\n"); + return; // .. + } + rd->in_sync = 1; + + /* borrowed from hot_remove_disk() */ + kick_rdev_from_array(rdev); + mddev->sb_dirty = 1; + } + } else { + /* in_sync disk failed (!degraded), have a spare, starting + proactive mirroring */ + if (conf->mirrorit == -1) { + printk(KERN_ALERT "resync from %s to spare %s (%d)\n", + bdevname(rdev->bdev,b), + bdevname(rdevs->bdev,b2), + conf->raid_disks); + + conf->mirrorit = rdev->raid_disk; + + mddev->degraded++; /* to call raid5_hot_add_disk(), reset there */ + } else { + printk(KERN_ALERT "proactive mirroring is active, let this device go\n"); + goto letitgo; + } + } + } else { +letitgo: + mddev->sb_dirty = 1; + if (rdev->in_sync) { + conf->working_disks--; + mddev->degraded++; + conf->failed_disks++; + rdev->in_sync = 0; + /* error() was not called if the syncing was stopped by IO error */ + if (conf->mirrorit != -1 && + !conf->disks[conf->raid_disks].rdev->in_sync) { + printk(KERN_NOTICE "stop proactive mirroring\n"); + conf->mirrorit = -1; + } + /* + * if recovery was running, make sure it aborts. + */ + set_bit(MD_RECOVERY_ERR, &mddev->recovery); + } + rdev->faulty = 1; + printk (KERN_ALERT + "raid5: Disk failure on %s, disabling device." + " Operation continuing on %d devices\n", + bdevname(rdev->bdev,b), conf->working_disks); + } } } @@ -896,6 +1207,74 @@ } +static int raid5_spare_active(mddev_t *mddev); + +static void raid5_eeh (mddev_t *mddev) +{ + raid5_conf_t *conf = mddev_to_conf(mddev); + int i = conf->mddev->eeh_data.failed_num; + struct disk_info *disk = &conf->disks[i]; + char b[BDEVNAME_SIZE]; + static char *envp[] = { "HOME=/", + "TERM=linux", + "PATH=/sbin:/usr/sbin:/bin:/usr/bin", + NULL }; + int ret; + int j; + + /* suspend IO; todo: well, we should walk over on disks and waiting till + (nr_pending > 0) */ + printk("raid5_usereh active [%d, %x]\n", i, disk->rdev); + + if (i < 0 || !disk->rdev) { + // fixme: why called on md_unregister? + printk(KERN_ALERT "ERROR: !disk->rdev [%d]\n", i); + goto eeh_out; + } + + if (mddev->degraded) { + printk(KERN_ALERT "array is already degraded, don't kick this device\n"); + goto eeh_out; + } + + { + char *argv[] = { "/sbin/mdevent", mdname(mddev), "drivefail", + bdevname(disk->rdev->bdev, b), NULL }; + ret = call_usermodehelper("/sbin/mdevent", argv, envp, 1/*wait*/); + ret = ret >> 8; + if (ret < 0 || ret > 1) { + printk(KERN_ALERT "/sbin/mdevent failed: %d\n", ret); + md_error(mddev, disk->rdev); + /* (the raid5_remove_disk and raid5_add_disk wasn't called yet) */ + } + } + + switch (ret) { + case 1: /* reset badblock cache (later: rewrite bad blocks?) */ + printk(KERN_INFO "resetting badblocks cache\n"); + for (j = 0; j < BB_NR_HASH; j++) { + struct badblock *bb, *bbprev = NULL; + bb = disk->badblock_hashtbl[j]; + for (; bb; bb = bb->hash_next) { + if (bbprev) + kmem_cache_free(disk->slab_cache, bbprev); + bb_remove_hash(bb); + bbprev = bb; + } + if (bbprev) + kmem_cache_free(disk->slab_cache, bbprev); + } + break; + default: + break; + } + +eeh_out: + mddev->eeh_data.failed_num = -1; /* unregister me */ + md_wakeup_thread(mddev->thread); + printk("raid5_usereh exited\n"); +} + /* * handle_stripe - do things to a stripe. * @@ -925,21 +1304,37 @@ int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0; int non_overwrite = 0; int failed_num=0; + int aspare=0, asparenum=-1; + struct disk_info *asparedev; struct r5dev *dev; PRINTK("handling stripe %llu, cnt=%d, pd_idx=%d\n", (unsigned long long)sh->sector, atomic_read(&sh->count), sh->pd_idx); + if (conf->mddev->eeh_thread) { + PRINTK("pass the stripe, eeh is active\n"); + set_bit(STRIPE_HANDLE, &sh->state); + return; + } + spin_lock(&sh->lock); clear_bit(STRIPE_HANDLE, &sh->state); clear_bit(STRIPE_DELAYED, &sh->state); syncing = test_bit(STRIPE_SYNCING, &sh->state); + asparedev = &conf->disks[conf->raid_disks]; + if (!conf->mddev->degraded && asparedev->rdev && !asparedev->rdev->faulty && + conf->mirrorit != -1) { + aspare++; + asparenum = sh->raid_conf->mirrorit; + PRINTK("has aspare (%d)\n", asparenum); + } /* Now to look around and see what can be done */ - for (i=disks; i--; ) { + for (i=disks+aspare; i--; ) { mdk_rdev_t *rdev; + struct badblock *bb = NULL; dev = &sh->dev[i]; clear_bit(R5_Insync, &dev->flags); clear_bit(R5_Syncio, &dev->flags); @@ -982,18 +1377,79 @@ } if (dev->written) written++; rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */ + if (rdev && rdev->in_sync && + !test_bit(R5_UPTODATE, &dev->flags) && + !test_bit(R5_LOCKED, &dev->flags)) { + /* ..potentially deserved to read, we must check it + checkme, it could be a big performance penalty if called + without a good reason! it's seems ok for now + */ + PRINTK("find_badblock %d: %llu\n", i, sh->sector); + bb = find_badblock(&conf->disks[i], sh->sector); + } if (!rdev || !rdev->in_sync) { /* The ReadError flag wil just be confusing now */ clear_bit(R5_ReadError, &dev->flags); clear_bit(R5_ReWrite, &dev->flags); } if (!rdev || !rdev->in_sync - || test_bit(R5_ReadError, &dev->flags)) { + || test_bit(R5_ReadError, &dev->flags) /*&& !test_bit(R5_UPTODATE, &dev->flags))*/ + || test_bit(R5_HardReadErr, &dev->flags) + || bb) { + if (rdev && rdev->in_sync + && !bb && test_bit(R5_HardReadErr, &dev->flags)) { + /* take an action only if it's a _new_ bad block + and not while proactive mirroring is running */ + + if (!aspare || (aspare && asparedev->rdev->in_sync/*asparenum != i*/)) { + /* if aspare is syncing we shouldn't register new + bad blocks, after the sync this disk will + be kicked anyway */ + + if (test_bit(R5_HardReadErr, &dev->flags)) { + PRINTK("store_badblock %d: %llu\n", i, sh->sector); + store_badblock(&conf->disks[i], sh->sector); + } + + if (count_badblocks(&conf->disks[i]) >= sysctl_badblock_tolerance) { + char b[BDEVNAME_SIZE]; + + printk(KERN_ALERT "too many badblocks (%lu) on device %s [%d]\n", + count_badblocks(&conf->disks[i]) + 1, bdevname(conf->disks[i].rdev->bdev, b), + atomic_read(&rdev->nr_pending)); +#ifndef USEREH + md_error(conf->mddev, conf->disks[i].rdev); +#else + if (!conf->mddev->eeh_thread) { + conf->mddev->eeh_thread = md_register_thread(raid5_eeh, conf->mddev, "%s_eeh"); + if (!conf->mddev->eeh_thread) { + printk(KERN_ERR + "raid5: couldn't allocate external error handler thread for %s\n", + mdname(conf->mddev)); + md_error(conf->mddev, conf->disks[i].rdev); + } else { + conf->mddev->eeh_data.failed_num = i; + md_wakeup_thread(conf->mddev->eeh_thread); + } + } +#endif + } + } + + // ha kozben volt masik IO es azt kapjuk elobb ide?? + // hasonlo dolog van a rewrite-nal is, egy cipo + clear_bit(R5_HardReadErr, &dev->flags); + } failed++; failed_num = i; - } else + PRINTK("device %d failed for this stripe r%p w%p\n", i, dev->toread, dev->towrite); + } else { set_bit(R5_Insync, &dev->flags); + } } + if (aspare && failed > 1) + failed--; /* failed = 1 means "all ok" if we've aspare, this is simplest + method to do our work */ PRINTK("locked=%d uptodate=%d to_read=%d" " to_write=%d failed=%d failed_num=%d\n", locked, uptodate, to_read, to_write, failed, failed_num); @@ -1047,7 +1503,7 @@ /* fail any reads if this device is non-operational */ if (!test_bit(R5_Insync, &sh->dev[i].flags) || - test_bit(R5_ReadError, &sh->dev[i].flags)) { + test_bit(R5_ReadError, &sh->dev[i].flags)) { // have meaning of this?? bi = sh->dev[i].toread; sh->dev[i].toread = NULL; if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) @@ -1070,6 +1526,8 @@ } } if (failed > 1 && syncing) { + printk(KERN_ALERT "sync stopped by IO error, marking the spare failed\n"); + conf->disks[failed_num].rdev->faulty = 1; md_done_sync(conf->mddev, STRIPE_SECTORS,0); clear_bit(STRIPE_SYNCING, &sh->state); syncing = 0; @@ -1249,6 +1707,26 @@ PRINTK("Writing block %d\n", i); locked++; set_bit(R5_Wantwrite, &sh->dev[i].flags); + if (aspare && i == asparenum) { + char *ps, *pd; + + /* mirroring this new block */ + PRINTK("Writing to aspare too %d->%d\n", + i, conf->raid_disks); + /*if (test_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags)) { + printk("bazmeg, ez lokkolt1!!!\n"); + }*/ + ps = page_address(sh->dev[i].page); + pd = page_address(sh->dev[conf->raid_disks].page); + /* better idea? */ + memcpy(pd, ps, STRIPE_SIZE); + set_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags); + set_bit(R5_Wantwrite, &sh->dev[conf->raid_disks].flags); + } + if (conf->disks[i].rdev && conf->disks[i].rdev->in_sync) { + PRINTK("reset badblock on %d: %llu\n", i, sh->sector); + delete_badblock(&conf->disks[i], sh->sector); + } if (!test_bit(R5_Insync, &sh->dev[i].flags) || (i==sh->pd_idx && failed == 0)) set_bit(STRIPE_INSYNC, &sh->state); @@ -1285,14 +1763,30 @@ if (failed==0) failed_num = sh->pd_idx; /* should be able to compute the missing block and write it to spare */ + if (aspare) + failed_num = asparenum; if (!test_bit(R5_UPTODATE, &sh->dev[failed_num].flags)) { if (uptodate+1 != disks) BUG(); compute_block(sh, failed_num); uptodate++; } + if (aspare) { + char *ps, *pd; + + ps = page_address(sh->dev[failed_num].page); + pd = page_address(sh->dev[conf->raid_disks].page); + memcpy(pd, ps, STRIPE_SIZE); + PRINTK("R5_Wantwrite to aspare, uptodate: %d %p->%p\n", + uptodate, ps, pd); + /*if (test_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags)) { + printk("bazmeg, ez lokkolt2!!!\n"); + }*/ + } if (uptodate != disks) BUG(); + if (aspare) + failed_num = conf->raid_disks; dev = &sh->dev[failed_num]; set_bit(R5_LOCKED, &dev->flags); set_bit(R5_Wantwrite, &dev->flags); @@ -1300,6 +1794,9 @@ locked++; set_bit(STRIPE_INSYNC, &sh->state); set_bit(R5_Syncio, &dev->flags); + /* !in_sync.. + printk("reset badblock on %d: %llu\n", failed_num, sh->sector); + delete_badblock(&conf->disks[failed_num], sh->sector);*/ } } if (syncing && locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) { @@ -1336,7 +1833,7 @@ bi->bi_size = 0; bi->bi_end_io(bi, bytes, 0); } - for (i=disks; i-- ;) { + for (i=disks+aspare; i-- ;) { int rw; struct bio *bi; mdk_rdev_t *rdev; @@ -1674,11 +2171,20 @@ md_check_recovery(mddev); + if (mddev->eeh_thread && mddev->eeh_data.failed_num == -1) { + printk(KERN_INFO "eeh_thread is done, unregistering\n"); + md_unregister_thread(mddev->eeh_thread); + mddev->eeh_thread = NULL; + } + handled = 0; spin_lock_irq(&conf->device_lock); while (1) { struct list_head *first; + if (mddev->eeh_thread) + break; + if (conf->seq_flush - conf->seq_write > 0) { int seq = conf->seq_flush; bitmap_unplug(mddev->bitmap); @@ -1733,11 +2239,11 @@ } mddev->private = kmalloc (sizeof (raid5_conf_t) - + mddev->raid_disks * sizeof(struct disk_info), + + (mddev->raid_disks + 1) * sizeof(struct disk_info), GFP_KERNEL); if ((conf = mddev->private) == NULL) goto abort; - memset (conf, 0, sizeof (*conf) + mddev->raid_disks * sizeof(struct disk_info) ); + memset (conf, 0, sizeof (*conf) + (mddev->raid_disks + 1) * sizeof(struct disk_info) ); conf->mddev = mddev; if ((conf->stripe_hashtbl = (struct stripe_head **) __get_free_pages(GFP_ATOMIC, HASH_PAGES_ORDER)) == NULL) @@ -1765,6 +2271,8 @@ disk->rdev = rdev; + grow_badblocks(disk); + if (rdev->in_sync) { char b[BDEVNAME_SIZE]; printk(KERN_INFO "raid5: device %s operational as raid" @@ -1775,6 +2283,8 @@ } conf->raid_disks = mddev->raid_disks; + conf->mirrorit = -1; + mddev->eeh_thread = NULL; /* just to be sure */ /* * 0 for a fully functional array, 1 for a degraded array. */ @@ -1825,7 +2335,7 @@ } } memory = conf->max_nr_stripes * (sizeof(struct stripe_head) + - conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024; + (conf->raid_disks+1) * ((sizeof(struct bio) + PAGE_SIZE))) / 1024; if (grow_stripes(conf, conf->max_nr_stripes)) { printk(KERN_ERR "raid5: couldn't allocate %dkB for buffers\n", memory); @@ -1887,10 +2397,19 @@ static int stop (mddev_t *mddev) { raid5_conf_t *conf = (raid5_conf_t *) mddev->private; + int i; + /* may blocked in user-space, kill it */ + if (mddev->eeh_thread) { + md_unregister_thread(mddev->eeh_thread); + mddev->eeh_thread = NULL; + } md_unregister_thread(mddev->thread); mddev->thread = NULL; shrink_stripes(conf); + for (i = conf->raid_disks; i--; ) + if (conf->disks[i].rdev && conf->disks[i].rdev->in_sync) + shrink_badblocks(&conf->disks[i]); free_pages((unsigned long) conf->stripe_hashtbl, HASH_PAGES_ORDER); blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/ kfree(conf); @@ -1936,7 +2455,9 @@ static void status (struct seq_file *seq, mddev_t *mddev) { raid5_conf_t *conf = (raid5_conf_t *) mddev->private; - int i; + int i, j; + char b[BDEVNAME_SIZE]; + struct badblock *bb; seq_printf (seq, " level %d, %dk chunk, algorithm %d", mddev->level, mddev->chunk_size >> 10, mddev->layout); seq_printf (seq, " [%d/%d] [", conf->raid_disks, conf->working_disks); @@ -1949,6 +2470,20 @@ #define D(x) \ seq_printf (seq, "<"#x":%d>", atomic_read(&conf->x)) printall(conf); + + spin_lock_irq(&conf->device_lock); /* it's ok now for debug */ + seq_printf (seq, "\n known bad sectors on active devices:"); + for (i = conf->raid_disks; i--; ) { + if (conf->disks[i].rdev) { + seq_printf (seq, "\n %s", bdevname(conf->disks[i].rdev->bdev, b)); + for (j = 0; j < BB_NR_HASH; j++) { + bb = conf->disks[i].badblock_hashtbl[j]; + for (; bb; bb = bb->hash_next) + seq_printf (seq, " %llu-%llu", bb->sector, bb->sector + (unsigned long long)(STRIPE_SIZE / 512) - 1); + } + } + } + spin_unlock_irq(&conf->device_lock); #endif } @@ -1992,6 +2527,16 @@ tmp->rdev->in_sync = 1; } } + tmp = conf->disks + i; + if (tmp->rdev && !tmp->rdev->faulty && !tmp->rdev->in_sync) { + tmp->rdev->in_sync = 1; + + printk(KERN_NOTICE "raid5_spare_active: %d in_sync %d->%d\n", + i, tmp->rdev->raid_disk, conf->mirrorit); + + /* scary..? :} */ + tmp->rdev->raid_disk = conf->mirrorit; + } print_raid5_conf(conf); return 0; } @@ -2005,6 +2550,7 @@ print_raid5_conf(conf); rdev = p->rdev; +printk("raid5_remove_disk %d\n", number); if (rdev) { if (rdev->in_sync || atomic_read(&rdev->nr_pending)) { @@ -2018,6 +2564,17 @@ err = -EBUSY; p->rdev = rdev; } + if (!err) { + shrink_badblocks(p); + + /* stopped by IO error.. */ + if (conf->mirrorit != -1 + && conf->disks[conf->raid_disks].rdev == NULL) { + printk(KERN_INFO "raid5_remove_disk: IO error on proactive mirroring of %d!\n", + conf->mirrorit); + conf->mirrorit = -1; + } + } } abort: @@ -2049,6 +2606,29 @@ p->rdev = rdev; break; } + + if (!found && conf->disks[disk].rdev == NULL) { + char b[BDEVNAME_SIZE]; + + /* array optimal, this should be the 'active spare' added by eeh_thread/error() */ + conf->disks[disk].rdev = rdev; + rdev->in_sync = 0; + rdev->raid_disk = conf->raid_disks; + conf->fullsync = 1; + + if (mddev->degraded) /* if we're here and it's true, we're called after error() */ + mddev->degraded--; + else + conf->mirrorit = mddev->eeh_data.failed_num; + found = 1; + + printk(KERN_NOTICE "added spare for proactive replacement of %s\n", + bdevname(conf->disks[conf->mirrorit].rdev->bdev, b)); + } + if (found) + grow_badblocks(&conf->disks[disk]); + printk(KERN_INFO "raid5_add_disk: %d (%d) in_sync: %d\n", disk, found, found ? rdev->in_sync : -1); + print_raid5_conf(conf); return found; }