'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")' introduces a sliding resync window for raid1 I/O barrier, this idea limits I/O barriers to happen only inside a slidingresync window, for regular I/Os out of this resync window they don't need to wait for barrier any more. On large raid1 device, it helps a lot to improve parallel writing I/O throughput when there are background resync I/Os performing at same time. The idea of sliding resync widow is awesome, but there are several challenges are very difficult to solve, - code complexity Sliding resync window requires several veriables to work collectively, this is complexed and very hard to make it work correctly. Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix the original resync window patch. This is not the end, any further related modification may easily introduce more regreassion. - multiple sliding resync windows Currently raid1 code only has a single sliding resync window, we cannot do parallel resync with current I/O barrier implementation. Implementing multiple resync windows are much more complexed, and very hard to make it correctly. Therefore I decide to implement a much simpler raid1 I/O barrier, by removing resync window code, I believe life will be much easier. The brief idea of the simpler barrier is, - Do not maintain a logbal unique resync window - Use multiple hash buckets to reduce I/O barrier conflictions, regular I/O only has to wait for a resync I/O when both them have same barrier bucket index, vice versa. - I/O barrier can be recuded to an acceptable number if there are enought barrier buckets Here I explain how the barrier buckets are designed, - BARRIER_UNIT_SECTOR_SIZE The whole LBA address space of a raid1 device is divided into multiple barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE. Bio request won't go across border of barrier unit size, that means maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes. - BARRIER_BUCKETS_NR There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests hit different barrier units, they only need to compete I/O barrier with other I/Os which hit the same barrier bucket index with each other. The index of a barrier bucket which a bio should look for is calculated by get_barrier_bucket_idx(), (sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR sector is the start sector number of a bio. align_to_barrier_unit_end() will make sure the finall bio sent into generic_make_request() won't exceed border of the barrier unit size. - RRIER_BUCKETS_NR Number of barrier buckets is defined by, #define BARRIER_BUCKETS_NR (PAGE_SIZE/sizeof(long)) For 4KB page size, there are 512 buckets for each raid1 device. That means the propobility of full random I/O barrier confliction may be reduced down to 1/512. Comparing to single sliding resync window, - Currently resync I/O grows linearly, therefore regular and resync I/O will have confliction within a single barrier units. So it is similar to single sliding resync window. - But a barrier unit bucket is shared by all barrier units with identical barrier uinit index, the probability of confliction might be higher than single sliding resync window, in condition that writing I/Os always hit barrier units which have identical barrier bucket index with the resync I/Os. This is a very rare condition in real I/O work loads, I cannot imagine how it could happen in practice. - Therefore we can achieve a good enough low confliction rate with much simpler barrier algorithm and implementation. If user has a (realy) large raid1 device, for example 10PB size, we may just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro, it is possible to be a raid1-created-time-defined variable in future. There are two changes should be noticed, - In raid1d(), I change the code to decrease conf->nr_pending[idx] into single loop, it looks like this, spin_lock_irqsave(&conf->device_lock, flags); conf->nr_queued[idx]--; spin_unlock_irqrestore(&conf->device_lock, flags); This change generates more spin lock operations, but in next patch of this patch set, it will be replaced by a single line code, atomic_dec(conf->nr_queueud[idx]); So we don't need to worry about spin lock cost here. - In raid1_make_request(), wait_barrier() is replaced by, a) wait_read_barrier(): wait barrier in regular read I/O code path b) wait_barrier(): wait barrier in regular write I/O code path The differnece is wait_read_barrier() only waits if array is frozen, I am not able to combile them into one function, because they must receive differnet data types in their arguments list. - align_to_barrier_unit_end() is called to make sure both regular and resync I/O won't go across the border of a barrier unit size. Open question: - Need review from md clustring developer, I don't touch related code now. Signed-off-by: Coly Li <colyli@xxxxxxx> Cc: Shaohua Li <shli@xxxxxx> Cc: Neil Brown <neilb@xxxxxxx> Cc: Johannes Thumshirn <jthumshirn@xxxxxxx> Cc: Guoqing Jiang <gqjiang@xxxxxxxx> --- drivers/md/raid1.c | 389 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------- drivers/md/raid1.h | 42 +++++------- 2 files changed, 242 insertions(+), 189 deletions(-) Index: linux-raid1/drivers/md/raid1.c =================================================================== --- linux-raid1.orig/drivers/md/raid1.c +++ linux-raid1/drivers/md/raid1.c @@ -66,9 +66,8 @@ */ static int max_queued_requests = 1024; -static void allow_barrier(struct r1conf *conf, sector_t start_next_window, - sector_t bi_sector); -static void lower_barrier(struct r1conf *conf); +static void allow_barrier(struct r1conf *conf, sector_t sector_nr); +static void lower_barrier(struct r1conf *conf, sector_t sector_nr); static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data) { @@ -92,7 +91,6 @@ static void r1bio_pool_free(void *r1_bio #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9) #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW) #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9) -#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS) static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data) { @@ -198,6 +196,7 @@ static void put_buf(struct r1bio *r1_bio { struct r1conf *conf = r1_bio->mddev->private; int i; + sector_t sector_nr = r1_bio->sector; for (i = 0; i < conf->raid_disks * 2; i++) { struct bio *bio = r1_bio->bios[i]; @@ -207,7 +206,7 @@ static void put_buf(struct r1bio *r1_bio mempool_free(r1_bio, conf->r1buf_pool); - lower_barrier(conf); + lower_barrier(conf, sector_nr); } static void reschedule_retry(struct r1bio *r1_bio) @@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi unsigned long flags; struct mddev *mddev = r1_bio->mddev; struct r1conf *conf = mddev->private; + sector_t sector_nr; + long idx; + + sector_nr = r1_bio->sector; + idx = get_barrier_bucket_idx(sector_nr); spin_lock_irqsave(&conf->device_lock, flags); list_add(&r1_bio->retry_list, &conf->retry_list); - conf->nr_queued ++; + conf->nr_queued[idx]++; spin_unlock_irqrestore(&conf->device_lock, flags); wake_up(&conf->wait_barrier); @@ -235,8 +239,6 @@ static void call_bio_endio(struct r1bio struct bio *bio = r1_bio->master_bio; int done; struct r1conf *conf = r1_bio->mddev->private; - sector_t start_next_window = r1_bio->start_next_window; - sector_t bi_sector = bio->bi_iter.bi_sector; if (bio->bi_phys_segments) { unsigned long flags; @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) bio->bi_error = -EIO; - if (done) { + if (done) bio_endio(bio); - /* - * Wake up any possible resync thread that waits for the device - * to go idle. - */ - allow_barrier(conf, start_next_window, bi_sector); - } } static void raid_end_bio_io(struct r1bio *r1_bio) { struct bio *bio = r1_bio->master_bio; + struct r1conf *conf = r1_bio->mddev->private; /* if nobody has done the final endio yet, do it now */ if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) { @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio call_bio_endio(r1_bio); } + + /* + * Wake up any possible resync thread that waits for the device + * to go idle. + */ + allow_barrier(conf, r1_bio->sector); free_r1bio(r1_bio); } @@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r return mirror; } +/* bi_end_io callback for regular READ bio */ static void raid1_end_read_request(struct bio *bio) { int uptodate = !bio->bi_error; @@ -490,6 +494,25 @@ static void raid1_end_write_request(stru bio_put(to_put); } +static sector_t align_to_barrier_unit_end(sector_t start_sector, + sector_t sectors) +{ + sector_t len; + + WARN_ON(sectors == 0); + /* len is the number of sectors from start_sector to end of the + * barrier unit which start_sector belongs to. + */ + len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) & + (~(BARRIER_UNIT_SECTOR_SIZE - 1))) - + start_sector; + + if (len > sectors) + len = sectors; + + return len; +} + /* * This routine returns the disk from which the requested read should * be done. There is a per-array 'next expected sequential IO' sector @@ -691,6 +714,7 @@ static int read_balance(struct r1conf *c conf->mirrors[best_disk].next_seq_sect = this_sector + sectors; } rcu_read_unlock(); + sectors = align_to_barrier_unit_end(this_sector, sectors); *max_sectors = sectors; return best_disk; @@ -779,168 +803,174 @@ static void flush_pending_writes(struct * there is no normal IO happeing. It must arrange to call * lower_barrier when the particular background IO completes. */ + static void raise_barrier(struct r1conf *conf, sector_t sector_nr) { + long idx = get_barrier_bucket_idx(sector_nr); + spin_lock_irq(&conf->resync_lock); /* Wait until no block IO is waiting */ - wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting, + wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx], conf->resync_lock); /* block any new IO from starting */ - conf->barrier++; - conf->next_resync = sector_nr; + conf->barrier[idx]++; /* For these conditions we must wait: * A: while the array is in frozen state - * B: while barrier >= RESYNC_DEPTH, meaning resync reach - * the max count which allowed. - * C: next_resync + RESYNC_SECTORS > start_next_window, meaning - * next resync will reach to the window which normal bios are - * handling. - * D: while there are any active requests in the current window. + * B: while conf->nr_pending[idx] is not 0, meaning regular I/O + * existing in sector number ranges corresponding to idx. + * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning resync reach + * the max count which allowed in sector number ranges + * conrresponding to idx. */ wait_event_lock_irq(conf->wait_barrier, - !conf->array_frozen && - conf->barrier < RESYNC_DEPTH && - conf->current_window_requests == 0 && - (conf->start_next_window >= - conf->next_resync + RESYNC_SECTORS), + !conf->array_frozen && !conf->nr_pending[idx] && + conf->barrier[idx] < RESYNC_DEPTH, conf->resync_lock); - - conf->nr_pending++; + conf->nr_pending[idx]++; spin_unlock_irq(&conf->resync_lock); } -static void lower_barrier(struct r1conf *conf) +static void lower_barrier(struct r1conf *conf, sector_t sector_nr) { unsigned long flags; - BUG_ON(conf->barrier <= 0); + long idx = get_barrier_bucket_idx(sector_nr); + + BUG_ON(conf->barrier[idx] <= 0); spin_lock_irqsave(&conf->resync_lock, flags); - conf->barrier--; - conf->nr_pending--; + conf->barrier[idx]--; + conf->nr_pending[idx]--; spin_unlock_irqrestore(&conf->resync_lock, flags); wake_up(&conf->wait_barrier); } -static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio) +/* A regular I/O should wait when, + * - The whole array is frozen (both READ and WRITE) + * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised + */ +static void _wait_barrier(struct r1conf *conf, long idx) { - bool wait = false; - - if (conf->array_frozen || !bio) - wait = true; - else if (conf->barrier && bio_data_dir(bio) == WRITE) { - if ((conf->mddev->curr_resync_completed - >= bio_end_sector(bio)) || - (conf->next_resync + NEXT_NORMALIO_DISTANCE - <= bio->bi_iter.bi_sector)) - wait = false; - else - wait = true; + spin_lock_irq(&conf->resync_lock); + if (conf->array_frozen || conf->barrier[idx]) { + conf->nr_waiting[idx]++; + /* Wait for the barrier to drop. */ + wait_event_lock_irq( + conf->wait_barrier, + !conf->array_frozen && !conf->barrier[idx], + conf->resync_lock); + conf->nr_waiting[idx]--; } - return wait; + conf->nr_pending[idx]++; + spin_unlock_irq(&conf->resync_lock); } -static sector_t wait_barrier(struct r1conf *conf, struct bio *bio) +static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr) { - sector_t sector = 0; + long idx = get_barrier_bucket_idx(sector_nr); spin_lock_irq(&conf->resync_lock); - if (need_to_wait_for_sync(conf, bio)) { - conf->nr_waiting++; - /* Wait for the barrier to drop. - * However if there are already pending - * requests (preventing the barrier from - * rising completely), and the - * per-process bio queue isn't empty, - * then don't wait, as we need to empty - * that queue to allow conf->start_next_window - * to increase. - */ - wait_event_lock_irq(conf->wait_barrier, - !conf->array_frozen && - (!conf->barrier || - ((conf->start_next_window < - conf->next_resync + RESYNC_SECTORS) && - current->bio_list && - !bio_list_empty(current->bio_list))), - conf->resync_lock); - conf->nr_waiting--; - } - - if (bio && bio_data_dir(bio) == WRITE) { - if (bio->bi_iter.bi_sector >= conf->next_resync) { - if (conf->start_next_window == MaxSector) - conf->start_next_window = - conf->next_resync + - NEXT_NORMALIO_DISTANCE; - - if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE) - <= bio->bi_iter.bi_sector) - conf->next_window_requests++; - else - conf->current_window_requests++; - sector = conf->start_next_window; - } + if (conf->array_frozen) { + conf->nr_waiting[idx]++; + /* Wait for array to unfreeze */ + wait_event_lock_irq( + conf->wait_barrier, + !conf->array_frozen, + conf->resync_lock); + conf->nr_waiting[idx]--; } - - conf->nr_pending++; + conf->nr_pending[idx]++; spin_unlock_irq(&conf->resync_lock); - return sector; } -static void allow_barrier(struct r1conf *conf, sector_t start_next_window, - sector_t bi_sector) +static void wait_barrier(struct r1conf *conf, sector_t sector_nr) +{ + long idx = get_barrier_bucket_idx(sector_nr); + + _wait_barrier(conf, idx); +} + +static void wait_all_barriers(struct r1conf *conf) +{ + long idx; + + for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++) + _wait_barrier(conf, idx); +} + +static void _allow_barrier(struct r1conf *conf, long idx) { unsigned long flags; spin_lock_irqsave(&conf->resync_lock, flags); - conf->nr_pending--; - if (start_next_window) { - if (start_next_window == conf->start_next_window) { - if (conf->start_next_window + NEXT_NORMALIO_DISTANCE - <= bi_sector) - conf->next_window_requests--; - else - conf->current_window_requests--; - } else - conf->current_window_requests--; - - if (!conf->current_window_requests) { - if (conf->next_window_requests) { - conf->current_window_requests = - conf->next_window_requests; - conf->next_window_requests = 0; - conf->start_next_window += - NEXT_NORMALIO_DISTANCE; - } else - conf->start_next_window = MaxSector; - } - } + conf->nr_pending[idx]--; spin_unlock_irqrestore(&conf->resync_lock, flags); wake_up(&conf->wait_barrier); } +static void allow_barrier(struct r1conf *conf, sector_t sector_nr) +{ + long idx = get_barrier_bucket_idx(sector_nr); + + _allow_barrier(conf, idx); +} + +static void allow_all_barriers(struct r1conf *conf) +{ + long idx; + + for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++) + _allow_barrier(conf, idx); +} + + +/* conf->resync_lock should be held */ +static int get_all_pendings(struct r1conf *conf) +{ + long idx; + int ret; + + for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++) + ret += conf->nr_pending[idx]; + return ret; +} + +/* conf->resync_lock should be held */ +static int get_all_queued(struct r1conf *conf) +{ + long idx; + int ret; + + for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++) + ret += conf->nr_queued[idx]; + return ret; +} + static void freeze_array(struct r1conf *conf, int extra) { /* stop syncio and normal IO and wait for everything to * go quite. - * We wait until nr_pending match nr_queued+extra + * We wait until get_all_pending() matches get_all_queued()+extra, + * which means sum of conf->nr_pending[] matches sum of + * conf->nr_queued[] plus extra (which might be 0 or 1). * This is called in the context of one normal IO request * that has failed. Thus any sync request that might be pending - * will be blocked by nr_pending, and we need to wait for + * will be blocked by a conf->nr_pending[idx] which the idx depends + * on the request's sector number, and we need to wait for * pending IO requests to complete or be queued for re-try. - * Thus the number queued (nr_queued) plus this request (extra) - * must match the number of pending IOs (nr_pending) before - * we continue. + * Thus the number queued (sum of conf->nr_queued[]) plus this + * request (extra) must match the number of pending IOs (sum + * of conf->nr_pending[]) before we continue. */ spin_lock_irq(&conf->resync_lock); conf->array_frozen = 1; - wait_event_lock_irq_cmd(conf->wait_barrier, - conf->nr_pending == conf->nr_queued+extra, - conf->resync_lock, - flush_pending_writes(conf)); + wait_event_lock_irq_cmd( + conf->wait_barrier, + get_all_pendings(conf) == get_all_queued(conf)+extra, + conf->resync_lock, + flush_pending_writes(conf)); spin_unlock_irq(&conf->resync_lock); } static void unfreeze_array(struct r1conf *conf) @@ -1031,6 +1061,7 @@ static void raid1_unplug(struct blk_plug kfree(plug); } + static void raid1_make_request(struct mddev *mddev, struct bio * bio) { struct r1conf *conf = mddev->private; @@ -1051,7 +1082,6 @@ static void raid1_make_request(struct md int first_clone; int sectors_handled; int max_sectors; - sector_t start_next_window; /* * Register the new request and wait if the reconstruction @@ -1087,8 +1117,6 @@ static void raid1_make_request(struct md finish_wait(&conf->wait_barrier, &w); } - start_next_window = wait_barrier(conf, bio); - bitmap = mddev->bitmap; /* @@ -1121,6 +1149,14 @@ static void raid1_make_request(struct md int rdisk; read_again: + /* Still need barrier for READ in case that whole + * array is frozen. + */ + wait_read_barrier(conf, r1_bio->sector); + /* max_sectors from read_balance is modified to no + * go across border of the barrier unit which + * r1_bio->sector is in. + */ rdisk = read_balance(conf, r1_bio, &max_sectors); if (rdisk < 0) { @@ -1140,7 +1176,6 @@ read_again: atomic_read(&bitmap->behind_writes) == 0); } r1_bio->read_disk = rdisk; - r1_bio->start_next_window = 0; read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev); bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector, @@ -1211,7 +1246,7 @@ read_again: disks = conf->raid_disks * 2; retry_write: - r1_bio->start_next_window = start_next_window; + wait_barrier(conf, r1_bio->sector); blocked_rdev = NULL; rcu_read_lock(); max_sectors = r1_bio->sectors; @@ -1279,27 +1314,17 @@ read_again: if (unlikely(blocked_rdev)) { /* Wait for this device to become unblocked */ int j; - sector_t old = start_next_window; for (j = 0; j < i; j++) if (r1_bio->bios[j]) rdev_dec_pending(conf->mirrors[j].rdev, mddev); r1_bio->state = 0; - allow_barrier(conf, start_next_window, bio->bi_iter.bi_sector); + allow_barrier(conf, r1_bio->sector); md_wait_for_blocked_rdev(blocked_rdev, mddev); - start_next_window = wait_barrier(conf, bio); - /* - * We must make sure the multi r1bios of bio have - * the same value of bi_phys_segments - */ - if (bio->bi_phys_segments && old && - old != start_next_window) - /* Wait for the former r1bio(s) to complete */ - wait_event(conf->wait_barrier, - bio->bi_phys_segments == 1); goto retry_write; } + max_sectors = align_to_barrier_unit_end(r1_bio->sector, max_sectors); if (max_sectors < r1_bio->sectors) { /* We are splitting this write into multiple parts, so * we need to prepare for allocating another r1_bio. @@ -1495,19 +1520,11 @@ static void print_conf(struct r1conf *co static void close_sync(struct r1conf *conf) { - wait_barrier(conf, NULL); - allow_barrier(conf, 0, 0); + wait_all_barriers(conf); + allow_all_barriers(conf); mempool_destroy(conf->r1buf_pool); conf->r1buf_pool = NULL; - - spin_lock_irq(&conf->resync_lock); - conf->next_resync = MaxSector - 2 * NEXT_NORMALIO_DISTANCE; - conf->start_next_window = MaxSector; - conf->current_window_requests += - conf->next_window_requests; - conf->next_window_requests = 0; - spin_unlock_irq(&conf->resync_lock); } static int raid1_spare_active(struct mddev *mddev) @@ -1787,7 +1804,7 @@ static int fix_sync_read_error(struct r1 struct bio *bio = r1_bio->bios[r1_bio->read_disk]; sector_t sect = r1_bio->sector; int sectors = r1_bio->sectors; - int idx = 0; + long idx = 0; while(sectors) { int s = sectors; @@ -1983,6 +2000,14 @@ static void process_checks(struct r1bio } } +/* If there is no error encountered during sync writing out, there are two + * places to destroy r1_bio: + * - sync_request_write(): If all wbio completed even before returning + * back to its caller. + * - end_sync_write(): when all remaining sync writes are done + * When there are error encountered from the above functions, r1_bio will + * be handled to handle_sync_write_finish() by reschedule_retry(). + */ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio) { struct r1conf *conf = mddev->private; @@ -2244,6 +2269,9 @@ static void handle_write_finished(struct { int m; bool fail = false; + sector_t sector_nr; + long idx; + for (m = 0; m < conf->raid_disks * 2 ; m++) if (r1_bio->bios[m] == IO_MADE_GOOD) { struct md_rdev *rdev = conf->mirrors[m].rdev; @@ -2269,7 +2297,9 @@ static void handle_write_finished(struct if (fail) { spin_lock_irq(&conf->device_lock); list_add(&r1_bio->retry_list, &conf->bio_end_io_list); - conf->nr_queued++; + sector_nr = r1_bio->sector; + idx = get_barrier_bucket_idx(sector_nr); + conf->nr_queued[idx]++; spin_unlock_irq(&conf->device_lock); md_wakeup_thread(conf->mddev->thread); } else { @@ -2380,6 +2410,8 @@ static void raid1d(struct md_thread *thr struct r1conf *conf = mddev->private; struct list_head *head = &conf->retry_list; struct blk_plug plug; + sector_t sector_nr; + long idx; md_check_recovery(mddev); @@ -2387,17 +2419,19 @@ static void raid1d(struct md_thread *thr !test_bit(MD_CHANGE_PENDING, &mddev->flags)) { LIST_HEAD(tmp); spin_lock_irqsave(&conf->device_lock, flags); - if (!test_bit(MD_CHANGE_PENDING, &mddev->flags)) { - while (!list_empty(&conf->bio_end_io_list)) { - list_move(conf->bio_end_io_list.prev, &tmp); - conf->nr_queued--; - } - } + if (!test_bit(MD_CHANGE_PENDING, &mddev->flags)) + list_splice_init(&conf->bio_end_io_list, &tmp); spin_unlock_irqrestore(&conf->device_lock, flags); + while (!list_empty(&tmp)) { r1_bio = list_first_entry(&tmp, struct r1bio, retry_list); list_del(&r1_bio->retry_list); + sector_nr = r1_bio->sector; + idx = get_barrier_bucket_idx(sector_nr); + spin_lock_irqsave(&conf->device_lock, flags); + conf->nr_queued[idx]--; + spin_unlock_irqrestore(&conf->device_lock, flags); if (mddev->degraded) set_bit(R1BIO_Degraded, &r1_bio->state); if (test_bit(R1BIO_WriteError, &r1_bio->state)) @@ -2418,7 +2452,9 @@ static void raid1d(struct md_thread *thr } r1_bio = list_entry(head->prev, struct r1bio, retry_list); list_del(head->prev); - conf->nr_queued--; + sector_nr = r1_bio->sector; + idx = get_barrier_bucket_idx(sector_nr); + conf->nr_queued[idx]--; spin_unlock_irqrestore(&conf->device_lock, flags); mddev = r1_bio->mddev; @@ -2457,7 +2493,6 @@ static int init_resync(struct r1conf *co conf->poolinfo); if (!conf->r1buf_pool) return -ENOMEM; - conf->next_resync = 0; return 0; } @@ -2486,6 +2521,7 @@ static sector_t raid1_sync_request(struc int still_degraded = 0; int good_sectors = RESYNC_SECTORS; int min_bad = 0; /* number of sectors that are bad in all devices */ + long idx = get_barrier_bucket_idx(sector_nr); if (!conf->r1buf_pool) if (init_resync(conf)) @@ -2535,7 +2571,7 @@ static sector_t raid1_sync_request(struc * If there is non-resync activity waiting for a turn, then let it * though before starting on this new sync request. */ - if (conf->nr_waiting) + if (conf->nr_waiting[idx]) schedule_timeout_uninterruptible(1); /* we are incrementing sector_nr below. To be safe, we check against @@ -2562,6 +2598,8 @@ static sector_t raid1_sync_request(struc r1_bio->sector = sector_nr; r1_bio->state = 0; set_bit(R1BIO_IsSync, &r1_bio->state); + /* make sure good_sectors won't go across barrier unit border */ + good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors); for (i = 0; i < conf->raid_disks * 2; i++) { struct md_rdev *rdev; @@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct if (!conf) goto abort; + conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL); + if (!conf->nr_pending) + goto abort; + + conf->nr_waiting = kzalloc(PAGE_SIZE, GFP_KERNEL); + if (!conf->nr_waiting) + goto abort; + + conf->nr_queued = kzalloc(PAGE_SIZE, GFP_KERNEL); + if (!conf->nr_queued) + goto abort; + + conf->barrier = kzalloc(PAGE_SIZE, GFP_KERNEL); + if (!conf->barrier) + goto abort; + conf->mirrors = kzalloc(sizeof(struct raid1_info) * mddev->raid_disks * 2, GFP_KERNEL); @@ -2841,9 +2895,6 @@ static struct r1conf *setup_conf(struct conf->pending_count = 0; conf->recovery_disabled = mddev->recovery_disabled - 1; - conf->start_next_window = MaxSector; - conf->current_window_requests = conf->next_window_requests = 0; - err = -EIO; for (i = 0; i < conf->raid_disks * 2; i++) { @@ -2890,6 +2941,10 @@ static struct r1conf *setup_conf(struct kfree(conf->mirrors); safe_put_page(conf->tmppage); kfree(conf->poolinfo); + kfree(conf->nr_pending); + kfree(conf->nr_waiting); + kfree(conf->nr_queued); + kfree(conf->barrier); kfree(conf); } return ERR_PTR(err); @@ -2992,6 +3047,10 @@ static void raid1_free(struct mddev *mdd kfree(conf->mirrors); safe_put_page(conf->tmppage); kfree(conf->poolinfo); + kfree(conf->nr_pending); + kfree(conf->nr_waiting); + kfree(conf->nr_queued); + kfree(conf->barrier); kfree(conf); } Index: linux-raid1/drivers/md/raid1.h =================================================================== --- linux-raid1.orig/drivers/md/raid1.h +++ linux-raid1/drivers/md/raid1.h @@ -1,6 +1,20 @@ #ifndef _RAID1_H #define _RAID1_H +/* each barrier unit size is 64MB fow now + * note: it must be larger than RESYNC_DEPTH + */ +#define BARRIER_UNIT_SECTOR_BITS 17 +#define BARRIER_UNIT_SECTOR_SIZE (1<<17) +#define BARRIER_BUCKETS_NR (PAGE_SIZE/sizeof(long)) + +/* will use bit shift later */ +static inline long get_barrier_bucket_idx(sector_t sector) +{ + return (long)(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR; + +} + struct raid1_info { struct md_rdev *rdev; sector_t head_position; @@ -35,25 +49,6 @@ struct r1conf { */ int raid_disks; - /* During resync, read_balancing is only allowed on the part - * of the array that has been resynced. 'next_resync' tells us - * where that is. - */ - sector_t next_resync; - - /* When raid1 starts resync, we divide array into four partitions - * |---------|--------------|---------------------|-------------| - * next_resync start_next_window end_window - * start_next_window = next_resync + NEXT_NORMALIO_DISTANCE - * end_window = start_next_window + NEXT_NORMALIO_DISTANCE - * current_window_requests means the count of normalIO between - * start_next_window and end_window. - * next_window_requests means the count of normalIO after end_window. - * */ - sector_t start_next_window; - int current_window_requests; - int next_window_requests; - spinlock_t device_lock; /* list of 'struct r1bio' that need to be processed by raid1d, @@ -79,10 +74,10 @@ struct r1conf { */ wait_queue_head_t wait_barrier; spinlock_t resync_lock; - int nr_pending; - int nr_waiting; - int nr_queued; - int barrier; + int *nr_pending; + int *nr_waiting; + int *nr_queued; + int *barrier; int array_frozen; /* Set to 1 if a full sync is needed, (fresh device added). @@ -135,7 +130,6 @@ struct r1bio { * in this BehindIO request */ sector_t sector; - sector_t start_next_window; int sectors; unsigned long state; struct mddev *mddev; -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html