[RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window

Coly Li <colyli@xxxxxxx> · Tue, 22 Nov 2016 05:54:00 +0800

'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
introduces a sliding resync window for raid1 I/O barrier, this idea limits
I/O barriers to happen only inside a slidingresync window, for regular
I/Os out of this resync window they don't need to wait for barrier any
more. On large raid1 device, it helps a lot to improve parallel writing
I/O throughput when there are background resync I/Os performing at
same time. 

The idea of sliding resync widow is awesome, but there are several
challenges are very difficult to solve,
 - code complexity
   Sliding resync window requires several veriables to work collectively,
   this is complexed and very hard to make it work correctly. Just grep
   "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
   the original resync window patch. This is not the end, any further
   related modification may easily introduce more regreassion.
 - multiple sliding resync windows
   Currently raid1 code only has a single sliding resync window, we cannot
   do parallel resync with current I/O barrier implementation.
   Implementing multiple resync windows are much more complexed, and very
   hard to make it correctly.

Therefore I decide to implement a much simpler raid1 I/O barrier, by
removing resync window code, I believe life will be much easier.

The brief idea of the simpler barrier is,
 - Do not maintain a logbal unique resync window
 - Use multiple hash buckets to reduce I/O barrier conflictions, regular
   I/O only has to wait for a resync I/O when both them have same barrier
   bucket index, vice versa.
 - I/O barrier can be recuded to an acceptable number if there are enought
   barrier buckets

Here I explain how the barrier buckets are designed,
 - BARRIER_UNIT_SECTOR_SIZE
   The whole LBA address space of a raid1 device is divided into multiple
   barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
   Bio request won't go across border of barrier unit size, that means
   maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
 - BARRIER_BUCKETS_NR
   There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests
   hit different barrier units, they only need to compete I/O barrier with
   other I/Os which hit the same barrier bucket index with each other. The
   index of a barrier bucket which a bio should look for is calculated by
   get_barrier_bucket_idx(),
	(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR
   sector is the start sector number of a bio. align_to_barrier_unit_end()
   will make sure the finall bio sent into generic_make_request() won't
   exceed border of the barrier unit size.
 - RRIER_BUCKETS_NR
   Number of barrier buckets is defined by,
	#define BARRIER_BUCKETS_NR	(PAGE_SIZE/sizeof(long))
   For 4KB page size, there are 512 buckets for each raid1 device. That
   means the propobility of full random I/O barrier confliction may be
   reduced down to 1/512.

Comparing to single sliding resync window,
 - Currently resync I/O grows linearly, therefore regular and resync I/O
   will have confliction within a single barrier units. So it is similar to
   single sliding resync window.
 - But a barrier unit bucket is shared by all barrier units with identical
   barrier uinit index, the probability of confliction might be higher
   than single sliding resync window, in condition that writing I/Os
   always hit barrier units which have identical barrier bucket index with
   the resync I/Os. This is a very rare condition in real I/O work loads,
   I cannot imagine how it could happen in practice.
 - Therefore we can achieve a good enough low confliction rate with much
   simpler barrier algorithm and implementation.

If user has a (realy) large raid1 device, for example 10PB size, we may
just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
it is possible to be a raid1-created-time-defined variable in future.

There are two changes should be noticed,
 - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
   single loop, it looks like this,
	spin_lock_irqsave(&conf->device_lock, flags);
	conf->nr_queued[idx]--;
	spin_unlock_irqrestore(&conf->device_lock, flags);
   This change generates more spin lock operations, but in next patch of
   this patch set, it will be replaced by a single line code,
	atomic_dec(conf->nr_queueud[idx]);
   So we don't need to worry about spin lock cost here.
 - In raid1_make_request(), wait_barrier() is replaced by,
   a) wait_read_barrier(): wait barrier in regular read I/O code path
   b) wait_barrier(): wait barrier in regular write I/O code path
   The differnece is wait_read_barrier() only waits if array is frozen, I
   am not able to combile them into one function, because they must receive
   differnet data types in their arguments list.
 - align_to_barrier_unit_end() is called to make sure both regular and
   resync I/O won't go across the border of a barrier unit size.
 
Open question:
 - Need review from md clustring developer, I don't touch related code now.

Signed-off-by: Coly Li <colyli@xxxxxxx>
Cc: Shaohua Li <shli@xxxxxx>
Cc: Neil Brown <neilb@xxxxxxx>
Cc: Johannes Thumshirn <jthumshirn@xxxxxxx>
Cc: Guoqing Jiang <gqjiang@xxxxxxxx>
---
 drivers/md/raid1.c | 389 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------
 drivers/md/raid1.h |  42 +++++-------
 2 files changed, 242 insertions(+), 189 deletions(-)

Index: linux-raid1/drivers/md/raid1.c
===================================================================

--- linux-raid1.orig/drivers/md/raid1.c
+++ linux-raid1/drivers/md/raid1.c
@@ -66,9 +66,8 @@
  */
 static int max_queued_requests = 1024;
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector);
-static void lower_barrier(struct r1conf *conf);
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
 
 static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
@@ -92,7 +91,6 @@ static void r1bio_pool_free(void *r1_bio
 #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
 #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
 #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
-#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
 
 static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
 {
@@ -198,6 +196,7 @@ static void put_buf(struct r1bio *r1_bio
 {
 	struct r1conf *conf = r1_bio->mddev->private;
 	int i;
+	sector_t sector_nr = r1_bio->sector;
 
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 		struct bio *bio = r1_bio->bios[i];
@@ -207,7 +206,7 @@ static void put_buf(struct r1bio *r1_bio
 
 	mempool_free(r1_bio, conf->r1buf_pool);
 
-	lower_barrier(conf);
+	lower_barrier(conf, sector_nr);
 }
 
 static void reschedule_retry(struct r1bio *r1_bio)
@@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
 	unsigned long flags;
 	struct mddev *mddev = r1_bio->mddev;
 	struct r1conf *conf = mddev->private;
+	sector_t sector_nr;
+	long idx;
+
+	sector_nr = r1_bio->sector;
+	idx = get_barrier_bucket_idx(sector_nr);
 
 	spin_lock_irqsave(&conf->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
-	conf->nr_queued ++;
+	conf->nr_queued[idx]++;
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	wake_up(&conf->wait_barrier);
@@ -235,8 +239,6 @@ static void call_bio_endio(struct r1bio
 	struct bio *bio = r1_bio->master_bio;
 	int done;
 	struct r1conf *conf = r1_bio->mddev->private;
-	sector_t start_next_window = r1_bio->start_next_window;
-	sector_t bi_sector = bio->bi_iter.bi_sector;
 
 	if (bio->bi_phys_segments) {
 		unsigned long flags;
@@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
 	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
 		bio->bi_error = -EIO;
 
-	if (done) {
+	if (done)
 		bio_endio(bio);
-		/*
-		 * Wake up any possible resync thread that waits for the device
-		 * to go idle.
-		 */
-		allow_barrier(conf, start_next_window, bi_sector);
-	}
 }
 
 static void raid_end_bio_io(struct r1bio *r1_bio)
 {
 	struct bio *bio = r1_bio->master_bio;
+	struct r1conf *conf = r1_bio->mddev->private;
 
 	/* if nobody has done the final endio yet, do it now */
 	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
@@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
 
 		call_bio_endio(r1_bio);
 	}
+
+	/*
+	 * Wake up any possible resync thread that waits for the device
+	 * to go idle.
+	 */
+	allow_barrier(conf, r1_bio->sector);
 	free_r1bio(r1_bio);
 }
 
@@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r
 	return mirror;
 }
 
+/* bi_end_io callback for regular READ bio */
 static void raid1_end_read_request(struct bio *bio)
 {
 	int uptodate = !bio->bi_error;
@@ -490,6 +494,25 @@ static void raid1_end_write_request(stru
 		bio_put(to_put);
 }
 
+static sector_t align_to_barrier_unit_end(sector_t start_sector,
+					  sector_t sectors)
+{
+	sector_t len;
+
+	WARN_ON(sectors == 0);
+	/* len is the number of sectors from start_sector to end of the
+	 * barrier unit which start_sector belongs to.
+	 */
+	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
+	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
+	      start_sector;
+
+	if (len > sectors)
+		len = sectors;
+
+	return len;
+}
+
 /*
  * This routine returns the disk from which the requested read should
  * be done. There is a per-array 'next expected sequential IO' sector
@@ -691,6 +714,7 @@ static int read_balance(struct r1conf *c
 		conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
 	}
 	rcu_read_unlock();
+	sectors = align_to_barrier_unit_end(this_sector, sectors);
 	*max_sectors = sectors;
 
 	return best_disk;
@@ -779,168 +803,174 @@ static void flush_pending_writes(struct
  *    there is no normal IO happeing.  It must arrange to call
  *    lower_barrier when the particular background IO completes.
  */
+
 static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 {
+	long idx = get_barrier_bucket_idx(sector_nr);
+
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
+	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier++;
-	conf->next_resync = sector_nr;
+	conf->barrier[idx]++;
 
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
-	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
-	 *    the max count which allowed.
-	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
-	 *    next resync will reach to the window which normal bios are
-	 *    handling.
-	 * D: while there are any active requests in the current window.
+	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
+	 *    existing in sector number ranges corresponding to idx.
+	 * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning resync reach
+	 *    the max count which allowed in sector number ranges
+	 *    conrresponding to idx.
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
-			    !conf->array_frozen &&
-			    conf->barrier < RESYNC_DEPTH &&
-			    conf->current_window_requests == 0 &&
-			    (conf->start_next_window >=
-			     conf->next_resync + RESYNC_SECTORS),
+			    !conf->array_frozen && !conf->nr_pending[idx] &&
+			    conf->barrier[idx] < RESYNC_DEPTH,
 			    conf->resync_lock);
-
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
 }
 
-static void lower_barrier(struct r1conf *conf)
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	unsigned long flags;
-	BUG_ON(conf->barrier <= 0);
+	long idx = get_barrier_bucket_idx(sector_nr);
+
+	BUG_ON(conf->barrier[idx] <= 0);
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier--;
-	conf->nr_pending--;
+	conf->barrier[idx]--;
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
-static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
+/* A regular I/O should wait when,
+ * - The whole array is frozen (both READ and WRITE)
+ * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised
+ */
+static void _wait_barrier(struct r1conf *conf, long idx)
 {
-	bool wait = false;
-
-	if (conf->array_frozen || !bio)
-		wait = true;
-	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
-		if ((conf->mddev->curr_resync_completed
-		     >= bio_end_sector(bio)) ||
-		    (conf->next_resync + NEXT_NORMALIO_DISTANCE
-		     <= bio->bi_iter.bi_sector))
-			wait = false;
-		else
-			wait = true;
+	spin_lock_irq(&conf->resync_lock);
+	if (conf->array_frozen || conf->barrier[idx]) {
+		conf->nr_waiting[idx]++;
+		/* Wait for the barrier to drop. */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen && !conf->barrier[idx],
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
 
-	return wait;
+	conf->nr_pending[idx]++;
+	spin_unlock_irq(&conf->resync_lock);
 }
 
-static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
+static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	sector_t sector = 0;
+	long idx = get_barrier_bucket_idx(sector_nr);
 
 	spin_lock_irq(&conf->resync_lock);
-	if (need_to_wait_for_sync(conf, bio)) {
-		conf->nr_waiting++;
-		/* Wait for the barrier to drop.
-		 * However if there are already pending
-		 * requests (preventing the barrier from
-		 * rising completely), and the
-		 * per-process bio queue isn't empty,
-		 * then don't wait, as we need to empty
-		 * that queue to allow conf->start_next_window
-		 * to increase.
-		 */
-		wait_event_lock_irq(conf->wait_barrier,
-				    !conf->array_frozen &&
-				    (!conf->barrier ||
-				     ((conf->start_next_window <
-				       conf->next_resync + RESYNC_SECTORS) &&
-				      current->bio_list &&
-				      !bio_list_empty(current->bio_list))),
-				    conf->resync_lock);
-		conf->nr_waiting--;
-	}
-
-	if (bio && bio_data_dir(bio) == WRITE) {
-		if (bio->bi_iter.bi_sector >= conf->next_resync) {
-			if (conf->start_next_window == MaxSector)
-				conf->start_next_window =
-					conf->next_resync +
-					NEXT_NORMALIO_DISTANCE;
-
-			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
-			    <= bio->bi_iter.bi_sector)
-				conf->next_window_requests++;
-			else
-				conf->current_window_requests++;
-			sector = conf->start_next_window;
-		}
+	if (conf->array_frozen) {
+		conf->nr_waiting[idx]++;
+		/* Wait for array to unfreeze */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen,
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
-
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
-	return sector;
 }
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector)
+static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	long idx = get_barrier_bucket_idx(sector_nr);
+
+	_wait_barrier(conf, idx);
+}
+
+static void wait_all_barriers(struct r1conf *conf)
+{
+	long idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_wait_barrier(conf, idx);
+}
+
+static void _allow_barrier(struct r1conf *conf, long idx)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending--;
-	if (start_next_window) {
-		if (start_next_window == conf->start_next_window) {
-			if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
-			    <= bi_sector)
-				conf->next_window_requests--;
-			else
-				conf->current_window_requests--;
-		} else
-			conf->current_window_requests--;
-
-		if (!conf->current_window_requests) {
-			if (conf->next_window_requests) {
-				conf->current_window_requests =
-					conf->next_window_requests;
-				conf->next_window_requests = 0;
-				conf->start_next_window +=
-					NEXT_NORMALIO_DISTANCE;
-			} else
-				conf->start_next_window = MaxSector;
-		}
-	}
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	long idx = get_barrier_bucket_idx(sector_nr);
+
+	_allow_barrier(conf, idx);
+}
+
+static void allow_all_barriers(struct r1conf *conf)
+{
+	long idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_allow_barrier(conf, idx);
+}
+
+
+/* conf->resync_lock should be held */
+static int get_all_pendings(struct r1conf *conf)
+{
+	long idx;
+	int ret;
+
+	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		ret += conf->nr_pending[idx];
+	return ret;
+}
+
+/* conf->resync_lock should be held */
+static int get_all_queued(struct r1conf *conf)
+{
+	long idx;
+	int  ret;
+
+	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		ret += conf->nr_queued[idx];
+	return ret;
+}
+
 static void freeze_array(struct r1conf *conf, int extra)
 {
 	/* stop syncio and normal IO and wait for everything to
 	 * go quite.
-	 * We wait until nr_pending match nr_queued+extra
+	 * We wait until get_all_pending() matches get_all_queued()+extra,
+	 * which means sum of conf->nr_pending[] matches sum of
+	 * conf->nr_queued[] plus extra (which might be 0 or 1).
 	 * This is called in the context of one normal IO request
 	 * that has failed. Thus any sync request that might be pending
-	 * will be blocked by nr_pending, and we need to wait for
+	 * will be blocked by a conf->nr_pending[idx] which the idx depends
+	 * on the request's sector number, and we need to wait for
 	 * pending IO requests to complete or be queued for re-try.
-	 * Thus the number queued (nr_queued) plus this request (extra)
-	 * must match the number of pending IOs (nr_pending) before
-	 * we continue.
+	 * Thus the number queued (sum of conf->nr_queued[]) plus this
+	 * request (extra) must match the number of pending IOs (sum
+	 * of conf->nr_pending[]) before we continue.
 	 */
 	spin_lock_irq(&conf->resync_lock);
 	conf->array_frozen = 1;
-	wait_event_lock_irq_cmd(conf->wait_barrier,
-				conf->nr_pending == conf->nr_queued+extra,
-				conf->resync_lock,
-				flush_pending_writes(conf));
+	wait_event_lock_irq_cmd(
+		conf->wait_barrier,
+		get_all_pendings(conf) == get_all_queued(conf)+extra,
+		conf->resync_lock,
+		flush_pending_writes(conf));
 	spin_unlock_irq(&conf->resync_lock);
 }
 static void unfreeze_array(struct r1conf *conf)
@@ -1031,6 +1061,7 @@ static void raid1_unplug(struct blk_plug
 	kfree(plug);
 }
 
+
 static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 {
 	struct r1conf *conf = mddev->private;
@@ -1051,7 +1082,6 @@ static void raid1_make_request(struct md
 	int first_clone;
 	int sectors_handled;
 	int max_sectors;
-	sector_t start_next_window;
 
 	/*
 	 * Register the new request and wait if the reconstruction
@@ -1087,8 +1117,6 @@ static void raid1_make_request(struct md
 		finish_wait(&conf->wait_barrier, &w);
 	}
 
-	start_next_window = wait_barrier(conf, bio);
-
 	bitmap = mddev->bitmap;
 
 	/*
@@ -1121,6 +1149,14 @@ static void raid1_make_request(struct md
 		int rdisk;
 
 read_again:
+		/* Still need barrier for READ in case that whole
+		 * array is frozen.
+		 */
+		wait_read_barrier(conf, r1_bio->sector);
+		/* max_sectors from read_balance is  modified to no
+		 * go across border of the barrier unit which
+		 * r1_bio->sector is in.
+		 */
 		rdisk = read_balance(conf, r1_bio, &max_sectors);
 
 		if (rdisk < 0) {
@@ -1140,7 +1176,6 @@ read_again:
 				   atomic_read(&bitmap->behind_writes) == 0);
 		}
 		r1_bio->read_disk = rdisk;
-		r1_bio->start_next_window = 0;
 
 		read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
 		bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
@@ -1211,7 +1246,7 @@ read_again:
 
 	disks = conf->raid_disks * 2;
  retry_write:
-	r1_bio->start_next_window = start_next_window;
+	wait_barrier(conf, r1_bio->sector);
 	blocked_rdev = NULL;
 	rcu_read_lock();
 	max_sectors = r1_bio->sectors;
@@ -1279,27 +1314,17 @@ read_again:
 	if (unlikely(blocked_rdev)) {
 		/* Wait for this device to become unblocked */
 		int j;
-		sector_t old = start_next_window;
 
 		for (j = 0; j < i; j++)
 			if (r1_bio->bios[j])
 				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
 		r1_bio->state = 0;
-		allow_barrier(conf, start_next_window, bio->bi_iter.bi_sector);
+		allow_barrier(conf, r1_bio->sector);
 		md_wait_for_blocked_rdev(blocked_rdev, mddev);
-		start_next_window = wait_barrier(conf, bio);
-		/*
-		 * We must make sure the multi r1bios of bio have
-		 * the same value of bi_phys_segments
-		 */
-		if (bio->bi_phys_segments && old &&
-		    old != start_next_window)
-			/* Wait for the former r1bio(s) to complete */
-			wait_event(conf->wait_barrier,
-				   bio->bi_phys_segments == 1);
 		goto retry_write;
 	}
 
+	max_sectors = align_to_barrier_unit_end(r1_bio->sector, max_sectors);
 	if (max_sectors < r1_bio->sectors) {
 		/* We are splitting this write into multiple parts, so
 		 * we need to prepare for allocating another r1_bio.
@@ -1495,19 +1520,11 @@ static void print_conf(struct r1conf *co
 
 static void close_sync(struct r1conf *conf)
 {
-	wait_barrier(conf, NULL);
-	allow_barrier(conf, 0, 0);
+	wait_all_barriers(conf);
+	allow_all_barriers(conf);
 
 	mempool_destroy(conf->r1buf_pool);
 	conf->r1buf_pool = NULL;
-
-	spin_lock_irq(&conf->resync_lock);
-	conf->next_resync = MaxSector - 2 * NEXT_NORMALIO_DISTANCE;
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests +=
-		conf->next_window_requests;
-	conf->next_window_requests = 0;
-	spin_unlock_irq(&conf->resync_lock);
 }
 
 static int raid1_spare_active(struct mddev *mddev)
@@ -1787,7 +1804,7 @@ static int fix_sync_read_error(struct r1
 	struct bio *bio = r1_bio->bios[r1_bio->read_disk];
 	sector_t sect = r1_bio->sector;
 	int sectors = r1_bio->sectors;
-	int idx = 0;
+	long idx = 0;
 
 	while(sectors) {
 		int s = sectors;
@@ -1983,6 +2000,14 @@ static void process_checks(struct r1bio
 	}
 }
 
+/* If there is no error encountered during sync writing out, there are two
+ * places to destroy r1_bio:
+ *  - sync_request_write(): If all wbio completed even before returning
+ *    back to its caller.
+ *  - end_sync_write(): when all remaining sync writes are done
+ * When there are error encountered from the above functions, r1_bio will
+ * be handled to handle_sync_write_finish() by reschedule_retry().
+ */
 static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
 {
 	struct r1conf *conf = mddev->private;
@@ -2244,6 +2269,9 @@ static void handle_write_finished(struct
 {
 	int m;
 	bool fail = false;
+	sector_t sector_nr;
+	long idx;
+
 	for (m = 0; m < conf->raid_disks * 2 ; m++)
 		if (r1_bio->bios[m] == IO_MADE_GOOD) {
 			struct md_rdev *rdev = conf->mirrors[m].rdev;
@@ -2269,7 +2297,9 @@ static void handle_write_finished(struct
 	if (fail) {
 		spin_lock_irq(&conf->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
-		conf->nr_queued++;
+		sector_nr = r1_bio->sector;
+		idx = get_barrier_bucket_idx(sector_nr);
+		conf->nr_queued[idx]++;
 		spin_unlock_irq(&conf->device_lock);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
@@ -2380,6 +2410,8 @@ static void raid1d(struct md_thread *thr
 	struct r1conf *conf = mddev->private;
 	struct list_head *head = &conf->retry_list;
 	struct blk_plug plug;
+	sector_t sector_nr;
+	long idx;
 
 	md_check_recovery(mddev);
 
@@ -2387,17 +2419,19 @@ static void raid1d(struct md_thread *thr
 	    !test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
 		LIST_HEAD(tmp);
 		spin_lock_irqsave(&conf->device_lock, flags);
-		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
-			while (!list_empty(&conf->bio_end_io_list)) {
-				list_move(conf->bio_end_io_list.prev, &tmp);
-				conf->nr_queued--;
-			}
-		}
+		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags))
+			list_splice_init(&conf->bio_end_io_list, &tmp);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
+
 		while (!list_empty(&tmp)) {
 			r1_bio = list_first_entry(&tmp, struct r1bio,
 						  retry_list);
 			list_del(&r1_bio->retry_list);
+			sector_nr = r1_bio->sector;
+			idx = get_barrier_bucket_idx(sector_nr);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			conf->nr_queued[idx]--;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2418,7 +2452,9 @@ static void raid1d(struct md_thread *thr
 		}
 		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
 		list_del(head->prev);
-		conf->nr_queued--;
+		sector_nr = r1_bio->sector;
+		idx = get_barrier_bucket_idx(sector_nr);
+		conf->nr_queued[idx]--;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 
 		mddev = r1_bio->mddev;
@@ -2457,7 +2493,6 @@ static int init_resync(struct r1conf *co
 					  conf->poolinfo);
 	if (!conf->r1buf_pool)
 		return -ENOMEM;
-	conf->next_resync = 0;
 	return 0;
 }
 
@@ -2486,6 +2521,7 @@ static sector_t raid1_sync_request(struc
 	int still_degraded = 0;
 	int good_sectors = RESYNC_SECTORS;
 	int min_bad = 0; /* number of sectors that are bad in all devices */
+	long idx = get_barrier_bucket_idx(sector_nr);
 
 	if (!conf->r1buf_pool)
 		if (init_resync(conf))
@@ -2535,7 +2571,7 @@ static sector_t raid1_sync_request(struc
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting)
+	if (conf->nr_waiting[idx])
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -2562,6 +2598,8 @@ static sector_t raid1_sync_request(struc
 	r1_bio->sector = sector_nr;
 	r1_bio->state = 0;
 	set_bit(R1BIO_IsSync, &r1_bio->state);
+	/* make sure good_sectors won't go across barrier unit border */
+	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
 
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 		struct md_rdev *rdev;
@@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct
 	if (!conf)
 		goto abort;
 
+	conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!conf->nr_pending)
+		goto abort;
+
+	conf->nr_waiting = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!conf->nr_waiting)
+		goto abort;
+
+	conf->nr_queued = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!conf->nr_queued)
+		goto abort;
+
+	conf->barrier = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!conf->barrier)
+		goto abort;
+
 	conf->mirrors = kzalloc(sizeof(struct raid1_info)
 				* mddev->raid_disks * 2,
 				 GFP_KERNEL);
@@ -2841,9 +2895,6 @@ static struct r1conf *setup_conf(struct
 	conf->pending_count = 0;
 	conf->recovery_disabled = mddev->recovery_disabled - 1;
 
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests = conf->next_window_requests = 0;
-
 	err = -EIO;
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 
@@ -2890,6 +2941,10 @@ static struct r1conf *setup_conf(struct
 		kfree(conf->mirrors);
 		safe_put_page(conf->tmppage);
 		kfree(conf->poolinfo);
+		kfree(conf->nr_pending);
+		kfree(conf->nr_waiting);
+		kfree(conf->nr_queued);
+		kfree(conf->barrier);
 		kfree(conf);
 	}
 	return ERR_PTR(err);
@@ -2992,6 +3047,10 @@ static void raid1_free(struct mddev *mdd
 	kfree(conf->mirrors);
 	safe_put_page(conf->tmppage);
 	kfree(conf->poolinfo);
+	kfree(conf->nr_pending);
+	kfree(conf->nr_waiting);
+	kfree(conf->nr_queued);
+	kfree(conf->barrier);
 	kfree(conf);
 }
 
Index: linux-raid1/drivers/md/raid1.h
===================================================================
--- linux-raid1.orig/drivers/md/raid1.h
+++ linux-raid1/drivers/md/raid1.h
@@ -1,6 +1,20 @@
 #ifndef _RAID1_H
 #define _RAID1_H
 
+/* each barrier unit size is 64MB fow now
+ * note: it must be larger than RESYNC_DEPTH
+ */
+#define BARRIER_UNIT_SECTOR_BITS	17
+#define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
+#define BARRIER_BUCKETS_NR		(PAGE_SIZE/sizeof(long))
+
+/* will use bit shift later */
+static inline long get_barrier_bucket_idx(sector_t sector)
+{
+	return (long)(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR;
+
+}
+
 struct raid1_info {
 	struct md_rdev	*rdev;
 	sector_t	head_position;
@@ -35,25 +49,6 @@ struct r1conf {
 						 */
 	int			raid_disks;
 
-	/* During resync, read_balancing is only allowed on the part
-	 * of the array that has been resynced.  'next_resync' tells us
-	 * where that is.
-	 */
-	sector_t		next_resync;
-
-	/* When raid1 starts resync, we divide array into four partitions
-	 * |---------|--------------|---------------------|-------------|
-	 *        next_resync   start_next_window       end_window
-	 * start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
-	 * end_window = start_next_window + NEXT_NORMALIO_DISTANCE
-	 * current_window_requests means the count of normalIO between
-	 *   start_next_window and end_window.
-	 * next_window_requests means the count of normalIO after end_window.
-	 * */
-	sector_t		start_next_window;
-	int			current_window_requests;
-	int			next_window_requests;
-
 	spinlock_t		device_lock;
 
 	/* list of 'struct r1bio' that need to be processed by raid1d,
@@ -79,10 +74,10 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			nr_pending;
-	int			nr_waiting;
-	int			nr_queued;
-	int			barrier;
+	int			*nr_pending;
+	int			*nr_waiting;
+	int			*nr_queued;
+	int			*barrier;
 	int			array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
@@ -135,7 +130,6 @@ struct r1bio {
 						 * in this BehindIO request
 						 */
 	sector_t		sector;
-	sector_t		start_next_window;
 	int			sectors;
 	unsigned long		state;
 	struct mddev		*mddev;
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html