Re: [PATCH 3/3] raid1: Rewrite the implementation of iobarrier.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 19 Aug 2013 19:58:07 +0800 majianpeng <majianpeng@xxxxxxxxx> wrote:

> There is a iobarrier in raid1 because the contend between nornalIO and
> resyncIO.It suspend all nornal IO when reysync.
> But if nornal IO is outrange the resync windwos,there is no contend.
> So I rewrite the iobarrier.
> 
> I patition the whole space into three parts.
> 
> |---------|-----------|------------|-----------|----------
> 	     Pa 	   next_resync	   Pb	       Pc
> 
> Firtly i introduce some concept:
> 1:RESYNC_WINDOW: For resync, there are 32 sync requests at most.A sync
> request is RESYNC_BLOCK_SIZE(64*1024).So the RESYNC_WINDOW is 32 *
> RESYNC_BLOCK_SIZE, that is 2MB.
> 2:next_resync mean the next start sector which will being do sync.
> 3:Pa means a position which before RESYNC_WINDOW distance from
> next_resync.
> 4:Pb means a position which after 3 * RESYNC_WINDOW distance from
> next_resync.
> 5:Pc means a position which after 6 * RESYNC_WINDOWS disttance from
> next_resync.
> 
> NornalIO will be partitioned into four types.
> 
> NoIO1: it means the end sector of bio is smaller or equal the Pa.
> NoIO2: it means the start sector of bio larger or equal Pc.
> NoIO3: it means the start sector of bio larger or equal Pb.For those
> 	tyeps, i don't care the location of end sector.
> NoIO4: it means the location between Pa and Pb.
> 
> For NoIO1,it don't use iobarrier.
> For NoIO4,it used the original iobarrier mechanism.The nornalIO and
> syncIO must be contend.
> For NoIO2/3, i add two filed in struct r1conf "after_resync_three_count,
> after_resync_six_count".They indicate the count of
> two types.For those, they don't wait.They only add the relevant the
> counter.
> 
> For resync action, if there are NoIO4s, it must be wait.If not,it
> forward.But if there are NoIO3,it must wait until the counter of NoIO3
> became zero.If there are NoIO2, it must wait until the counter of NoIO2
> became zero.
> 
> There is a problem which when and how to change from NoIO2 to NoIO3.Only
> so, the sync action can forward.
> 
> I add a filed in struct r1conf "after_resync_sector".It means the
> position Pb.What condition will change?
> A: if after_resync_sector == MaxSector, it means there are no NoIO2/3.
>    So after_resync_sector = Pb.
> B: if after_resync_six_count == 0 && after_resync_six_count != 0, it
>    means after_resync_sector forward 3 * RESYNC_WINDOW.
> 
> There is anthor problem which how to differentiate when NoIO2 became
> NoIO3.For example: firstly r1bioA is NoIO2.Before r1bioA completed,
> there is no NoIO3.So r1bioA will be became to NoIO3.
> At generil,we should use flags and list_head. The codes should:
> >list_splice_init(six_head, three_head);
> >after_reysnc_three_count = after_resync_six_count;
> >after_resync_six_count = 0;
> >list_for_each_entry(three_head){
> >	clear_bit(RESYNC_SIX_FLAG),
> >	set_bit(RESYNC_THREE_FLAG),
> >}
> If there are many NoIO2, it will take too long in list_for_each_entry.
> Avoid this, i add a filed in struct r1bio
> "after_threetimes_resync_sector".
> I used this to record the position Pb when call wait_barrier in func
> make_request.
> For NoIO1/4, the value is zero.
> In func allow_barrier, it check the conf->after_resync_sector.
> If after_threetimes_resync_sector == conf->after_resync_sector,it mean
> there is no transition which NoIO2 to NoIO3.In this condtion
>  if bio->bi_sector > conf->after_resync_sector + 3 * RESYNC_WINWOD
>     it means the bio is NoIO2;
>  else
>     it means the bio si NoIO3.
> 
> If after_threetimes_resync_sector != conf->after_resync_sector,it mean
> there is a trnasiton which NoIO2 to NoIO3.Because there is at most one
> transtions.So it only means the bio is NoIO2.
> 
> For one bio,there are many r1bio.So we make sure the
> r1bio->after_threetimes_resync_sector is the same value.
> If we met blocked_dev, it must call allow_barrier and wait_barrier.
> So the first and the second value of conf->after_resync_sector will
> change.If there are many r1bio's with differnet
> after_threetimes_resync_sector.For the relevant bio, it depent on the
> last value of r1bio.It will cause error.To avoid this, we must wait
> former r1bios completes.
> 
> Signed-off-by: Jianpeng Ma <majianpeng@xxxxxxxxx>

Hi,
 this is certainly looking better.  I've made a number of comments below...



> ---
>  drivers/md/raid1.c | 153 +++++++++++++++++++++++++++++++++++++++++++++--------
>  drivers/md/raid1.h |   4 ++
>  2 files changed, 134 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 2fd74ec..92909ea 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -66,7 +66,9 @@
>   */
>  static int max_queued_requests = 1024;
>  
> -static void allow_barrier(struct r1conf *conf);
> +static void allow_barrier(struct r1conf *conf,
> +				sector_t after_threetimes_resync_sector,
> +				sector_t bi_start_sector);
>  static void lower_barrier(struct r1conf *conf);
>  
>  static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
> @@ -84,10 +86,11 @@ static void r1bio_pool_free(void *r1_bio, void *data)
>  }
>  
>  #define RESYNC_BLOCK_SIZE (64*1024)
> -//#define RESYNC_BLOCK_SIZE PAGE_SIZE
> +#define RESYNC_DEPTH 32
>  #define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)
>  #define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
> -#define RESYNC_WINDOW (2048*1024)
> +#define RESYNC_WINDOW (RESYNC_BLOCK_SIZE * RESYNC_DEPTH)
> +#define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)

A patch which just tidies up these #defines might be nice.  This patch is
rather large and anything that can be split into a separate patch should be I
think.


>  
>  static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
>  {
> @@ -225,13 +228,17 @@ static void call_bio_endio(struct r1bio *r1_bio)
>  	struct bio *bio = r1_bio->master_bio;
>  	int done;
>  	struct r1conf *conf = r1_bio->mddev->private;
> -
> +	unsigned long flags;

Why have you moved 'flags' from where it was to here?  There is no need so
best to leave it alone.

> +	sector_t after_threetimes_resync_sector =
> +		r1_bio->after_threetimes_resync_sector;
> +	sector_t bi_start_sector = bio->bi_sector;
> +

"after_threetimes_resync_sector" is much too long for a variable name.
And if one day I decide that 4 times resync_sectors is better, I would need
to change that name everywhere.

There are (as you describe above) two windows, each 3*resync_sectors in size.
A "current" window in which no normal writes are permitted, and a "next"
window.  "after_threetimes_resync_sector" is the send of the current window,
and the start of the next window.  So
  start_next_sync_window
would be a much better name, and is noticeably shorter.  I would probably
prefer
  start_next_window
which is quite a manageable size.
  end_current_window
would also be OK.


>  	if (bio->bi_phys_segments) {
> -		unsigned long flags;
>  		spin_lock_irqsave(&conf->device_lock, flags);
>  		bio->bi_phys_segments--;
>  		done = (bio->bi_phys_segments == 0);
>  		spin_unlock_irqrestore(&conf->device_lock, flags);
> +		wake_up(&conf->wait_barrier);

A brief comment explaining why this wake_up is needed would help.
   /* make_request might be waiting for bi_phys_sectors to decrease */


>  	} else
>  		done = 1;
>  
> @@ -243,7 +250,8 @@ static void call_bio_endio(struct r1bio *r1_bio)
>  		 * Wake up any possible resync thread that waits for the device
>  		 * to go idle.
>  		 */
> -		allow_barrier(conf);
> +		allow_barrier(conf, after_threetimes_resync_sector,
> +				bi_start_sector);
>  	}
>  }
>  
> @@ -814,8 +822,6 @@ static void flush_pending_writes(struct r1conf *conf)
>   *    there is no normal IO happeing.  It must arrange to call
>   *    lower_barrier when the particular background IO completes.
>   */
> -#define RESYNC_DEPTH 32
> -
>  static void raise_barrier(struct r1conf *conf)
>  {
>  	spin_lock_irq(&conf->resync_lock);
> @@ -826,11 +832,18 @@ static void raise_barrier(struct r1conf *conf)
>  
>  	/* block any new IO from starting */
>  	conf->barrier++;
> -
> -	/* Now wait for all pending IO to complete */
> +	/* For those conditions, we must wait
> +	 * A:array is in freeze state.
> +	 * B:When starting resync and there are normal IO
> +	 * C:There is no nornal io after three times in resync window
> +	 * D: total barrier are less than RESYNC_DEPTH*/
>  	wait_event_lock_irq(conf->wait_barrier,
> -			    !conf->freeze_array && 
> -			    !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
> +			    !conf->freeze_array &&
> +			    ((conf->next_resync < RESYNC_WINDOW_SECTORS
> +				&& !conf->nr_pending)
> +			    || (conf->next_resync + RESYNC_SECTORS
> +				    <= conf->after_resync_sector))
> +			    && conf->barrier < RESYNC_DEPTH,
>  			    conf->resync_lock);

For "C:There is no nornal io after three times in resync window" I expected
to see
          conf->after_resync_three_count == 0

but I don't see that.  Why not?


>  
>  	spin_unlock_irq(&conf->resync_lock);
> @@ -846,10 +859,57 @@ static void lower_barrier(struct r1conf *conf)
>  	wake_up(&conf->wait_barrier);
>  }
>  
> -static void wait_barrier(struct r1conf *conf)
> +static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>  {
> +	sector_t sector = 0;
>  	spin_lock_irq(&conf->resync_lock);
> -	if (conf->barrier) {
> +	if (conf->barrier || unlikely(conf->freeze_array)) {
> +		if (unlikely(!conf->freeze_array && bio) &&
> +			bio_data_dir(bio) == WRITE &&
> +			conf->next_resync > RESYNC_WINDOW_SECTORS) {
> +			if (bio_end_sector(bio) <= conf->next_resync -
> +				RESYNC_WINDOW_SECTORS)
> +				goto no_wait;
> +			if (conf->after_resync_sector == MaxSector) {
> +				if (bio->bi_sector >= conf->next_resync +
> +					6 * RESYNC_WINDOW_SECTORS) {
> +					if (conf->after_resync_six_count +
> +						conf->after_resync_three_count == 0)
> +						conf->after_resync_sector =
> +							conf->next_resync + 3 *
> +							RESYNC_WINDOW_SECTORS;
> +					conf->after_resync_six_count++;
> +					sector = conf->after_resync_sector;
> +					goto no_wait;
> +				}
> +				if (bio->bi_sector >= conf->next_resync +
> +					3 * RESYNC_WINDOW_SECTORS) {
> +					if (conf->after_resync_six_count +
> +						conf->after_resync_three_count == 0)
> +						conf->after_resync_sector =
> +							conf->next_resync + 3 *
> +							RESYNC_WINDOW_SECTORS;
> +					conf->after_resync_three_count++;
> +					sector = conf->after_resync_sector;
> +					goto no_wait;
> +				}
> +
> +			} else {
> +				if (bio->bi_sector >= conf->after_resync_sector + 3 *
> +							RESYNC_WINDOW_SECTORS) {
> +					conf->after_resync_six_count++;
> +					sector = conf->after_resync_sector;
> +					goto no_wait;
> +				}
> +				if (bio->bi_sector >= conf->after_resync_sector) {
> +					conf->after_resync_three_count++;
> +					sector = conf->after_resync_sector;
> +					goto no_wait;
> +				}
> +			}
> +		} else if (unlikely(!conf->freeze_array && bio) &&
> +				bio_data_dir(bio) != WRITE)
> +			goto no_wait;

This is really long set of nested conditions that is probably more complex
than needed, and certainly should be in a separate function.
Then you could have something like:
   if (need_to_wait_for_sync(...) {
        conf->nr_waiting++
        wait.....
        conf->nr_waiting--;
   }

and not need the "no_wait" label.

I suspect there should only be one place in the function which does
after_resync_three_count++ and one for after_resync_six_count++;

And they should be something like
   current_window_requests
and
   next_window_requests.

>  		conf->nr_waiting++;
>  		/* Wait for the barrier to drop.
>  		 * However if there are already pending
> @@ -869,15 +929,46 @@ static void wait_barrier(struct r1conf *conf)
>  				    conf->resync_lock);
>  		conf->nr_waiting--;
>  	}
> +no_wait:
>  	conf->nr_pending++;
>  	spin_unlock_irq(&conf->resync_lock);
> +	return sector;
>  }
>  
> -static void allow_barrier(struct r1conf *conf)
> +static void

Please avoid this sort of unnecessary change.  Leave "allow_barrier" on the
same line as "static void".

> +allow_barrier(struct r1conf *conf,
> +		sector_t after_threetimes_resync_sector,
> +		sector_t bi_start_sector)
>  {
>  	unsigned long flags;
>  	spin_lock_irqsave(&conf->resync_lock, flags);
>  	conf->nr_pending--;
> +
> +	if (after_threetimes_resync_sector) {
> +		if (after_threetimes_resync_sector ==
> +				conf->after_resync_sector) {
> +			if (bi_start_sector >= (after_threetimes_resync_sector
> +						+ 3 * RESYNC_WINDOW_SECTORS))
> +				conf->after_resync_six_count--;
> +			else
> +				conf->after_resync_three_count--;
> +		} else
> +			conf->after_resync_three_count--;
> +
> +		if (!conf->after_resync_three_count) {
> +			if (conf->after_resync_six_count) {
> +				conf->after_resync_three_count =
> +					conf->after_resync_six_count;
> +				conf->after_resync_six_count = 0;
> +				conf->after_resync_sector +=
> +					3 * RESYNC_WINDOW_SECTORS;
> +			} else
> +				conf->after_resync_sector = MaxSector;
> +		}
> +		BUG_ON(conf->after_resync_three_count < 0 ||
> +			conf->after_resync_six_count < 0);
> +	}
> +
>  	spin_unlock_irqrestore(&conf->resync_lock, flags);
>  	wake_up(&conf->wait_barrier);
>  }
> @@ -908,8 +999,8 @@ static void unfreeze_array(struct r1conf *conf)
>  	/* reverse the effect of the freeze */
>  	spin_lock_irq(&conf->resync_lock);
>  	conf->freeze_array = 0;
> -	wake_up(&conf->wait_barrier);
>  	spin_unlock_irq(&conf->resync_lock);
> +	wake_up(&conf->wait_barrier);

This seems unnecessary, so should not be here.  Is there a reason you made
this change?

>  }
>  
>  
> @@ -1012,6 +1103,7 @@ static void make_request(struct mddev *mddev, struct bio * bio)
>  	int first_clone;
>  	int sectors_handled;
>  	int max_sectors;
> +	sector_t after_threetimes_resync_sector = 0;
>  
>  	/*
>  	 * Register the new request and wait if the reconstruction
> @@ -1041,7 +1133,7 @@ static void make_request(struct mddev *mddev, struct bio * bio)
>  		finish_wait(&conf->wait_barrier, &w);
>  	}
>  
> -	wait_barrier(conf);
> +	after_threetimes_resync_sector = wait_barrier(conf, bio);
>  
>  	bitmap = mddev->bitmap;
>  
> @@ -1057,7 +1149,6 @@ static void make_request(struct mddev *mddev, struct bio * bio)
>  	r1_bio->state = 0;
>  	r1_bio->mddev = mddev;
>  	r1_bio->sector = bio->bi_sector;
> -

Don't delete blanks lines unnecessarily.

>  	/* We might need to issue multiple reads to different
>  	 * devices if there are bad blocks around, so we keep
>  	 * track of the number of reads in bio->bi_phys_segments.
> @@ -1162,6 +1253,8 @@ read_again:
>  
>  	disks = conf->raid_disks * 2;
>   retry_write:
> +	r1_bio->after_threetimes_resync_sector = after_threetimes_resync_sector;
> +
>  	blocked_rdev = NULL;
>  	rcu_read_lock();
>  	max_sectors = r1_bio->sectors;
> @@ -1230,14 +1323,23 @@ read_again:
>  	if (unlikely(blocked_rdev)) {
>  		/* Wait for this device to become unblocked */
>  		int j;
> -
> +		sector_t old = after_threetimes_resync_sector;

Please keep a blank line between variable declarations and code.


>  		for (j = 0; j < i; j++)
>  			if (r1_bio->bios[j])
>  				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
>  		r1_bio->state = 0;
> -		allow_barrier(conf);
> +		allow_barrier(conf, old, bio->bi_sector);
>  		md_wait_for_blocked_rdev(blocked_rdev, mddev);
> -		wait_barrier(conf);
> +		after_threetimes_resync_sector = wait_barrier(conf, bio);
> +		/*
> +		 * We must make sure the multi r1bios of bio have
> +		 * the same value of after_threetimes_resync_sector
> +		 */
> +		if (bio->bi_phys_segments && old &&
> +			old != after_threetimes_resync_sector)
> +			/*wait the former r1bio(s) completed*/
> +			wait_event(conf->wait_barrier,
> +					bio->bi_phys_segments == 1);

I think it is quite possible that bi_phys_segments==0 at this point.
So you probably want "<= 1".

>  		goto retry_write;
>  	}
>  
> @@ -1437,11 +1539,13 @@ static void print_conf(struct r1conf *conf)
>  
>  static void close_sync(struct r1conf *conf)
>  {
> -	wait_barrier(conf);
> -	allow_barrier(conf);
> +	wait_barrier(conf, NULL);
> +	allow_barrier(conf, 0, 0);
>  
>  	mempool_destroy(conf->r1buf_pool);
>  	conf->r1buf_pool = NULL;
> +	conf->next_resync = 0;
> +	conf->after_resync_sector = conf->mddev->dev_sectors;
>  }
>  
>  static int raid1_spare_active(struct mddev *mddev)
> @@ -2712,6 +2816,9 @@ static struct r1conf *setup_conf(struct mddev *mddev)
>  	conf->pending_count = 0;
>  	conf->recovery_disabled = mddev->recovery_disabled - 1;
>  
> +	conf->after_resync_sector = MaxSector;
> +	conf->after_resync_six_count = conf->after_resync_three_count = 0;
> +
>  	err = -EIO;
>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>  
> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
> index 8afd300..3ab1405 100644
> --- a/drivers/md/raid1.h
> +++ b/drivers/md/raid1.h
> @@ -40,6 +40,9 @@ struct r1conf {
>  	 * where that is.
>  	 */
>  	sector_t		next_resync;
> +	sector_t		after_resync_sector;
> +	int			after_resync_three_count;
> +	int			after_resync_six_count;
>  
>  	spinlock_t		device_lock;
>  
> @@ -112,6 +115,7 @@ struct r1bio {
>  						 * in this BehindIO request
>  						 */
>  	sector_t		sector;
> +	sector_t		after_threetimes_resync_sector;
>  	int			sectors;
>  	unsigned long		state;
>  	struct mddev		*mddev;

Thanks,
NeilBrown

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux