Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window

Coly Li <colyli@xxxxxxx> · Wed, 30 Nov 2016 10:57:10 +0800

On 2016/11/30 上午3:29, Shaohua Li wrote:
> On Mon, Nov 28, 2016 at 02:42:22PM +0800, Coly Li wrote:
>> On 2016/11/23 上午5:35, Shaohua Li wrote:
>>> On Tue, Nov 22, 2016 at 05:54:00AM +0800, Coly Li wrote:
>>>> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
>>>> introduces a sliding resync window for raid1 I/O barrier, this idea limits
>>>> I/O barriers to happen only inside a slidingresync window, for regular
>>>> I/Os out of this resync window they don't need to wait for barrier any
>>>> more. On large raid1 device, it helps a lot to improve parallel writing
>>>> I/O throughput when there are background resync I/Os performing at
>>>> same time. 
>>>>
>>>> The idea of sliding resync widow is awesome, but there are several
>>>> challenges are very difficult to solve,
>>>>  - code complexity
>>>>    Sliding resync window requires several veriables to work collectively,
>>>>    this is complexed and very hard to make it work correctly. Just grep
>>>>    "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
>>>>    the original resync window patch. This is not the end, any further
>>>>    related modification may easily introduce more regreassion.
>>>>  - multiple sliding resync windows
>>>>    Currently raid1 code only has a single sliding resync window, we cannot
>>>>    do parallel resync with current I/O barrier implementation.
>>>>    Implementing multiple resync windows are much more complexed, and very
>>>>    hard to make it correctly.
>>>>
>>>> Therefore I decide to implement a much simpler raid1 I/O barrier, by
>>>> removing resync window code, I believe life will be much easier.
>>>>
>>>> The brief idea of the simpler barrier is,
>>>>  - Do not maintain a logbal unique resync window
>>>>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>>>>    I/O only has to wait for a resync I/O when both them have same barrier
>>>>    bucket index, vice versa.
>>>>  - I/O barrier can be recuded to an acceptable number if there are enought
>>>>    barrier buckets
>>>>
>>>> Here I explain how the barrier buckets are designed,
>>>>  - BARRIER_UNIT_SECTOR_SIZE
>>>>    The whole LBA address space of a raid1 device is divided into multiple
>>>>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>>>>    Bio request won't go across border of barrier unit size, that means
>>>>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
>>>>  - BARRIER_BUCKETS_NR
>>>>    There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests
>>>>    hit different barrier units, they only need to compete I/O barrier with
>>>>    other I/Os which hit the same barrier bucket index with each other. The
>>>>    index of a barrier bucket which a bio should look for is calculated by
>>>>    get_barrier_bucket_idx(),
>>>> 	(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR
>>>>    sector is the start sector number of a bio. align_to_barrier_unit_end()
>>>>    will make sure the finall bio sent into generic_make_request() won't
>>>>    exceed border of the barrier unit size.
>>>>  - RRIER_BUCKETS_NR
>>>>    Number of barrier buckets is defined by,
>>>> 	#define BARRIER_BUCKETS_NR	(PAGE_SIZE/sizeof(long))
>>>>    For 4KB page size, there are 512 buckets for each raid1 device. That
>>>>    means the propobility of full random I/O barrier confliction may be
>>>>    reduced down to 1/512.
>>>
>>> Thanks! The idea is awesome and does makes the code easier to understand.
>>>
>>> For hash conflict, I'm worrying about one case. resync does sequential access.
>>> Say we have a sequential normal IO. If the first normal IO and resync IO have
>>> conflict, it's possible they will have conflict subsequently, since both are
>>> doing sequential access. We can change the hash to something like this:
>>>
>>> for the first 64M * 512, the hash is 0, 1, ... 511
>>> For the second 64M * 512, the hash is 1, 3, 5 ...
>>> The third 64M * 512, the hash is 2, 5, 8 ...
>>>
>>> It should be very easy to implement something like this, and this should reduce
>>> the conflict of sequential access.
>>>  
>>
>> Hi Shaohua,
>>
>> What I concerned was memory footprint. For very fast sequential I/O,
>> lineal mapping buckets means each (64 bytes) cache line contains 8 long
>> integer, it is very friendly for CPU caching,
>>  - sequential writing 8 barrier units only requires 1 memory fetching
>> for each barrier variables (barrier[], nr_pending[], nr_waiting[],
>> nr_queued[]).
>>  - memory prefetch may have positive effect in sequential I/O.
>>
>> It will be very rare that resync I/O and regular I/O are always
>> conflicted in same barrier bucket: resync I/O throughput usually is
>> slower than regular I/O, it is very hard to keep them always conflicting
>> in same bucket. If this condition really happens, current sliding resync
>> window implementation may also have a similar conflict (regular I/O
>> always hits resync window). So the barrier buckets don't make things worse.
>>
>> If we use a non-continuous mapping, memory fingerprint of the buckets
>> will be quite distributed, and occupies more cache lines for a
>> sequential I/O, which will increase the probability of more cache bounce
>> if there are multiple sequential I/O (on different offset) happen on
>> different CPU cores.
>>
>> This is why I use a very simple linear buckets hashing.
> 
> Don't think memory footprint matters much. And wasting a little memory (eg,
> make the barrier and stuffes cacheline aligned) doesn't matter here too if
> necessary. If we could reduce the hash confilict in an easy way, why not? I
> think Neil's suggestion (using hash_long) makes sense.
>  

Okey, I will do it in next version.

>>>> Comparing to single sliding resync window,
>>>>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>>>>    will have confliction within a single barrier units. So it is similar to
>>>>    single sliding resync window.
>>>>  - But a barrier unit bucket is shared by all barrier units with identical
>>>>    barrier uinit index, the probability of confliction might be higher
>>>>    than single sliding resync window, in condition that writing I/Os
>>>>    always hit barrier units which have identical barrier bucket index with
>>>>    the resync I/Os. This is a very rare condition in real I/O work loads,
>>>>    I cannot imagine how it could happen in practice.
>>>>  - Therefore we can achieve a good enough low confliction rate with much
>>>>    simpler barrier algorithm and implementation.
>>>>
>>>> If user has a (realy) large raid1 device, for example 10PB size, we may
>>>> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
>>>> it is possible to be a raid1-created-time-defined variable in future.
>>>>
>>>> There are two changes should be noticed,
>>>>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>>>>    single loop, it looks like this,
>>>> 	spin_lock_irqsave(&conf->device_lock, flags);
>>>> 	conf->nr_queued[idx]--;
>>>> 	spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>    This change generates more spin lock operations, but in next patch of
>>>>    this patch set, it will be replaced by a single line code,
>>>> 	atomic_dec(conf->nr_queueud[idx]);
>>>>    So we don't need to worry about spin lock cost here.
>>>>  - In raid1_make_request(), wait_barrier() is replaced by,
>>>>    a) wait_read_barrier(): wait barrier in regular read I/O code path
>>>>    b) wait_barrier(): wait barrier in regular write I/O code path
>>>>    The differnece is wait_read_barrier() only waits if array is frozen, I
>>>>    am not able to combile them into one function, because they must receive
>>>>    differnet data types in their arguments list.
>>>>  - align_to_barrier_unit_end() is called to make sure both regular and
>>>>    resync I/O won't go across the border of a barrier unit size.
>>>>  
>>>> Open question:
>>>>  - Need review from md clustring developer, I don't touch related code now.
>>>
>>> Don't think it matters, but please open eyes, Guoqing!
>>>  
>>>> Signed-off-by: Coly Li <colyli@xxxxxxx>
>>>> Cc: Shaohua Li <shli@xxxxxx>
>>>> Cc: Neil Brown <neilb@xxxxxxx>
>>>> Cc: Johannes Thumshirn <jthumshirn@xxxxxxx>
>>>> Cc: Guoqing Jiang <gqjiang@xxxxxxxx>
>>>> ---
>>>>  drivers/md/raid1.c | 389 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------
>>>>  drivers/md/raid1.h |  42 +++++-------
>>>>  2 files changed, 242 insertions(+), 189 deletions(-)
>>>>
>>>> Index: linux-raid1/drivers/md/raid1.c
>>>> ===================================================================
>>>> --- linux-raid1.orig/drivers/md/raid1.c
>>>> +++ linux-raid1/drivers/md/raid1.c
>>>> @@ -66,9 +66,8 @@
>>>>   */
>>>>  static int max_queued_requests = 1024;
>>>>  
>>>> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
>>>> -			  sector_t bi_sector);
>>>> -static void lower_barrier(struct r1conf *conf);
>>>> +static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
>>>> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>>>>  
>>>>  static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
>>>>  {
>>>> @@ -92,7 +91,6 @@ static void r1bio_pool_free(void *r1_bio
>>>>  #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
>>>>  #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
>>>>  #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
>>>> -#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
>>>>  
>>>>  static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
>>>>  {
>>>> @@ -198,6 +196,7 @@ static void put_buf(struct r1bio *r1_bio
>>>>  {
>>>>  	struct r1conf *conf = r1_bio->mddev->private;
>>>>  	int i;
>>>> +	sector_t sector_nr = r1_bio->sector;
>>>>  
>>>>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>>>>  		struct bio *bio = r1_bio->bios[i];
>>>> @@ -207,7 +206,7 @@ static void put_buf(struct r1bio *r1_bio
>>>>  
>>>>  	mempool_free(r1_bio, conf->r1buf_pool);
>>>>  
>>>> -	lower_barrier(conf);
>>>> +	lower_barrier(conf, sector_nr);
>>>>  }
>>>>  
>>>>  static void reschedule_retry(struct r1bio *r1_bio)
>>>> @@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
>>>>  	unsigned long flags;
>>>>  	struct mddev *mddev = r1_bio->mddev;
>>>>  	struct r1conf *conf = mddev->private;
>>>> +	sector_t sector_nr;
>>>> +	long idx;
>>>> +
>>>> +	sector_nr = r1_bio->sector;
>>>> +	idx = get_barrier_bucket_idx(sector_nr);
>>>>  
>>>>  	spin_lock_irqsave(&conf->device_lock, flags);
>>>>  	list_add(&r1_bio->retry_list, &conf->retry_list);
>>>> -	conf->nr_queued ++;
>>>> +	conf->nr_queued[idx]++;
>>>>  	spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>  
>>>>  	wake_up(&conf->wait_barrier);
>>>> @@ -235,8 +239,6 @@ static void call_bio_endio(struct r1bio
>>>>  	struct bio *bio = r1_bio->master_bio;
>>>>  	int done;
>>>>  	struct r1conf *conf = r1_bio->mddev->private;
>>>> -	sector_t start_next_window = r1_bio->start_next_window;
>>>> -	sector_t bi_sector = bio->bi_iter.bi_sector;
>>>>  
>>>>  	if (bio->bi_phys_segments) {
>>>>  		unsigned long flags;
>>>> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
>>>>  	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>>>>  		bio->bi_error = -EIO;
>>>>  
>>>> -	if (done) {
>>>> +	if (done)
>>>>  		bio_endio(bio);
>>>> -		/*
>>>> -		 * Wake up any possible resync thread that waits for the device
>>>> -		 * to go idle.
>>>> -		 */
>>>> -		allow_barrier(conf, start_next_window, bi_sector);
>>>> -	}
>>>>  }
>>>>  
>>>>  static void raid_end_bio_io(struct r1bio *r1_bio)
>>>>  {
>>>>  	struct bio *bio = r1_bio->master_bio;
>>>> +	struct r1conf *conf = r1_bio->mddev->private;
>>>>  
>>>>  	/* if nobody has done the final endio yet, do it now */
>>>>  	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
>>>> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
>>>>  
>>>>  		call_bio_endio(r1_bio);
>>>>  	}
>>>> +
>>>> +	/*
>>>> +	 * Wake up any possible resync thread that waits for the device
>>>> +	 * to go idle.
>>>> +	 */
>>>> +	allow_barrier(conf, r1_bio->sector);
>>> Why this change?
>>>
>>
>> I move allow_barrier() here on purpose. Because current raid1 code
>> raises nr_pending for each master bio, the barrier buckets patch raises
>> nr_pending[idx] for each r1_bio.
>>
>> Current code increases nr_pending for each master bio (the bio structure
>> received by raid1_make_request()). A master bio may be split by multiple
>> r1_bio structures, when all r1_bio completes, nr_pending is decreased
>> for the original master bio.
>>
>> Since the master bio may go across multiple barrier units, in this patch
>> master bio will be split into multiple r1_bio structures for each
>> barrier units. In this case, we need to call wait_barrer() for each
>> r1_bio, and call allow_barrier() on each r1_bio completion. This is why
>> allow_barrier() is moved form master bio completion time to r1_bio
>> completion time.
>>
>> A master bio may also be split into multiple r1_bio if bad blocks
>> encountered, and these r1_bio may stay in same barrier unit. In this
>> case the different r1_bio does increase nr_pending[] for same bucket
>> index, we should also call wait_barrier() for each r1_bio and call
>> allow_barrier() at its completion time.
> 
> I think if you do bio_split before raid1_make_request, these issues go away.
> 

Let me try it :-)

>>>>  	free_r1bio(r1_bio);
>>>>  }
>>>>  
>>>> @@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r
>>>>  	return mirror;
>>>>  }
>>>>  
>>>> +/* bi_end_io callback for regular READ bio */
>>>>  static void raid1_end_read_request(struct bio *bio)
>>>>  {
>>>>  	int uptodate = !bio->bi_error;
>>>> @@ -490,6 +494,25 @@ static void raid1_end_write_request(stru
>>>>  		bio_put(to_put);
>>>>  }
>>>>  
>>>> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
>>>> +					  sector_t sectors)
>>>> +{
>>>> +	sector_t len;
>>>> +
>>>> +	WARN_ON(sectors == 0);
>>>> +	/* len is the number of sectors from start_sector to end of the
>>>> +	 * barrier unit which start_sector belongs to.
>>>> +	 */
>>>> +	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
>>>> +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
>>>> +	      start_sector;
>>>> +
>>>> +	if (len > sectors)
>>>> +		len = sectors;
>>>> +
>>>> +	return len;
>>>> +}
>>>
>>> I'd prefer calling bio_split at the early of raid1_make_request and split the
>>> bio if it crosses the bucket boundary. please see raid10_make_request for
>>> example. resync will not cross the boundary. So the code will not need to worry
>>> about the boundary. I believe this will make the code simpiler (all the
>>> align_to_barrier_unit_end calling can removed) and easy to understand.
>>>
>>
>> Indeed, my first implementation uses bio_split(). The reasons why I
>> don't use it later are,
>> - indeed I need to write more code.
>> - I can't simply use existing r1_bio completion code to call
>> bio_end_io() to the master bio when bio->bi_phys_segments is zero (see
>> call_bio_endio()). Because r1_bio->master_bio is the split bio, not the
>> original master bio received by raid1_make_request()
>>
>> Therefore finally I decide to use align_to_barrier_unit_end() to
>> generate more r1_bio structures if the original master bio goes across
>> barrier unit size, it makes less modification in this patch.
> 
> That bi_phys_segments issue doesn't make sense to me. Please see how raid10 use
> bio_split.
> - rename raid1_make_request to __raid1_make_request
> - add a raid1_make_request, split bio there, and then call __raid1_make_request
> 
> the __raid1_make_request does everthing it currently does, the bi_phys_segments
> should still work. This might add more code. But now we don't need to worry
> about the boundary problem everywhere else except the new raid1_make_request.
> For example, the allow_barrier issue goes away with this approach. This will
> make the code more understandable and less error prone.
>  

Thanks for the hint. I will use this method in next version.

>>>> +
>>>>  /*
>>>>   * This routine returns the disk from which the requested read should
>>>>   * be done. There is a per-array 'next expected sequential IO' sector
>>>> @@ -691,6 +714,7 @@ static int read_balance(struct r1conf *c
>>>>  		conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
>>>>  	}
>>>>  	rcu_read_unlock();
>>>> +	sectors = align_to_barrier_unit_end(this_sector, sectors);
>>>>  	*max_sectors = sectors;
>>>>  
>>>>  	return best_disk;
>>>> @@ -779,168 +803,174 @@ static void flush_pending_writes(struct
>>>>   *    there is no normal IO happeing.  It must arrange to call
>>>>   *    lower_barrier when the particular background IO completes.
>>>>   */
>>>> +
>>>>  static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
>>>>  {
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>> +
>>>>  	spin_lock_irq(&conf->resync_lock);
>>>>  
>>>>  	/* Wait until no block IO is waiting */
>>>> -	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
>>>> +	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
>>>>  			    conf->resync_lock);
>>>>  
>>>>  	/* block any new IO from starting */
>>>> -	conf->barrier++;
>>>> -	conf->next_resync = sector_nr;
>>>> +	conf->barrier[idx]++;
>>>>  
>>>>  	/* For these conditions we must wait:
>>>>  	 * A: while the array is in frozen state
>>>> -	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
>>>> -	 *    the max count which allowed.
>>>> -	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
>>>> -	 *    next resync will reach to the window which normal bios are
>>>> -	 *    handling.
>>>> -	 * D: while there are any active requests in the current window.
>>>> +	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
>>>> +	 *    existing in sector number ranges corresponding to idx.
>>>> +	 * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning resync reach
>>>> +	 *    the max count which allowed in sector number ranges
>>>> +	 *    conrresponding to idx.
>>>>  	 */
>>>>  	wait_event_lock_irq(conf->wait_barrier,
>>>> -			    !conf->array_frozen &&
>>>> -			    conf->barrier < RESYNC_DEPTH &&
>>>> -			    conf->current_window_requests == 0 &&
>>>> -			    (conf->start_next_window >=
>>>> -			     conf->next_resync + RESYNC_SECTORS),
>>>> +			    !conf->array_frozen && !conf->nr_pending[idx] &&
>>>> +			    conf->barrier[idx] < RESYNC_DEPTH,
>>>>  			    conf->resync_lock);
>>>
>>> So there is a slight behavior change. Originally we guarautee no more than
>>> RESYNC_DEPTH sync. Now this only checks 'conf->barrier[idx] < RESYNC_DEPTH'.
>>> How about barrier[idx-1], barrier[idx-2]...? It's possible conf->barrier[idx] <
>>> RESYNC_DEPTH, but barrier[idx] + barrier[idx-1] > RESYNC_DEPTH. Not sure how
>>> important this is, but at least we can check barrier[idx] + barrier[idx-1].
>>
>> The original code wants to rise more multiple barriers, but less then
>> RESYNC_DEPTH. It helps resync I/O throughput with less resync I/O and
>> regular I/O conflict. Considering conf->barrier is a global barrier,
>> large RESYNC_DEPTH will starve regular I/O, it is only set to 32 for
>> whole raid1 device.
>>
>> In the barrier buckets patch, conf->barrier[idx] only takes effect in a
>> single bucket. More barriers raised in this barrier buckets won't
>> interfere regular I/O in other barrier buckets, therefore we can have
>> much more resync I/O barriers to raise, which is good for resync I/O
>> throughput without starve more regular I/Os.
>>
>> If we have 512 barrier buckets, that means on the whole raid1 device
>> there are 512*RESYNC_DEPTH barriers can be raised, and allows more
>> parallel resync I/O.
>>
>> Before implementing parallel resync, I keep RESYNC_DEPTH as 32 in this
>> patch, because we only have a resync I/O hits one barrier buckets, same
>> RESYNC_DEPTH won't change barrier behavior now. Latter when we have
>> parallel resync I/O, let's see whether we need to modify this value,
>> also still keep it as 32.
> 
> The problem is resync write could harm latency of regular IO. Even we have
> resync limitation mechanism, sending a lot of resync write out could harm
> regular IO latency for worst case. If the application has P99 metrics, more
> resync IO definitively will harm it. This will not be easy to observe though.
> Could we have a global pending counter to record barrier IO and using it in the
> same way as the old code? That said I'm not sure how important this is, but
> this could make things worse, so I'd like to avoid the regression.
>  

Okey, I will maintain a global barrier depth counter, and keep global
barrier depth to 32 as current upstream code behavior does.

>>>> -
>>>> -	conf->nr_pending++;
>>>> +	conf->nr_pending[idx]++;
>>>>  	spin_unlock_irq(&conf->resync_lock);
>>>>  }
>>>>  
>>>> -static void lower_barrier(struct r1conf *conf)
>>>> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
>>>>  {
>>>>  	unsigned long flags;
>>>> -	BUG_ON(conf->barrier <= 0);
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>> +
>>>> +	BUG_ON(conf->barrier[idx] <= 0);
>>>>  	spin_lock_irqsave(&conf->resync_lock, flags);
>>>> -	conf->barrier--;
>>>> -	conf->nr_pending--;
>>>> +	conf->barrier[idx]--;
>>>> +	conf->nr_pending[idx]--;
>>>>  	spin_unlock_irqrestore(&conf->resync_lock, flags);
>>>>  	wake_up(&conf->wait_barrier);
>>>>  }
>>>>  
>>>> -static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
>>>> +/* A regular I/O should wait when,
>>>> + * - The whole array is frozen (both READ and WRITE)
>>>> + * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised
>>>> + */
>>>> +static void _wait_barrier(struct r1conf *conf, long idx)
>>>>  {
>>>> -	bool wait = false;
>>>> -
>>>> -	if (conf->array_frozen || !bio)
>>>> -		wait = true;
>>>> -	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
>>>> -		if ((conf->mddev->curr_resync_completed
>>>> -		     >= bio_end_sector(bio)) ||
>>>> -		    (conf->next_resync + NEXT_NORMALIO_DISTANCE
>>>> -		     <= bio->bi_iter.bi_sector))
>>>> -			wait = false;
>>>> -		else
>>>> -			wait = true;
>>>> +	spin_lock_irq(&conf->resync_lock);
>>>> +	if (conf->array_frozen || conf->barrier[idx]) {
>>>> +		conf->nr_waiting[idx]++;
>>>> +		/* Wait for the barrier to drop. */
>>>> +		wait_event_lock_irq(
>>>> +			conf->wait_barrier,
>>>> +			!conf->array_frozen && !conf->barrier[idx],
>>>> +			conf->resync_lock);
>>>> +		conf->nr_waiting[idx]--;
>>>>  	}
>>>>  
>>>> -	return wait;
>>>> +	conf->nr_pending[idx]++;
>>>> +	spin_unlock_irq(&conf->resync_lock);
>>>>  }
>>>>  
>>>> -static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>>>> +static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>>>>  {
>>>> -	sector_t sector = 0;
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>>  
>>>>  	spin_lock_irq(&conf->resync_lock);
>>>> -	if (need_to_wait_for_sync(conf, bio)) {
>>>> -		conf->nr_waiting++;
>>>> -		/* Wait for the barrier to drop.
>>>> -		 * However if there are already pending
>>>> -		 * requests (preventing the barrier from
>>>> -		 * rising completely), and the
>>>> -		 * per-process bio queue isn't empty,
>>>> -		 * then don't wait, as we need to empty
>>>> -		 * that queue to allow conf->start_next_window
>>>> -		 * to increase.
>>>> -		 */
>>>> -		wait_event_lock_irq(conf->wait_barrier,
>>>> -				    !conf->array_frozen &&
>>>> -				    (!conf->barrier ||
>>>> -				     ((conf->start_next_window <
>>>> -				       conf->next_resync + RESYNC_SECTORS) &&
>>>> -				      current->bio_list &&
>>>> -				      !bio_list_empty(current->bio_list))),
>>>> -				    conf->resync_lock);
>>>> -		conf->nr_waiting--;
>>>> -	}
>>>> -
>>>> -	if (bio && bio_data_dir(bio) == WRITE) {
>>>> -		if (bio->bi_iter.bi_sector >= conf->next_resync) {
>>>> -			if (conf->start_next_window == MaxSector)
>>>> -				conf->start_next_window =
>>>> -					conf->next_resync +
>>>> -					NEXT_NORMALIO_DISTANCE;
>>>> -
>>>> -			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
>>>> -			    <= bio->bi_iter.bi_sector)
>>>> -				conf->next_window_requests++;
>>>> -			else
>>>> -				conf->current_window_requests++;
>>>> -			sector = conf->start_next_window;
>>>> -		}
>>>> +	if (conf->array_frozen) {
>>>> +		conf->nr_waiting[idx]++;
>>>> +		/* Wait for array to unfreeze */
>>>> +		wait_event_lock_irq(
>>>> +			conf->wait_barrier,
>>>> +			!conf->array_frozen,
>>>> +			conf->resync_lock);
>>>> +		conf->nr_waiting[idx]--;
>>>>  	}
>>>> -
>>>> -	conf->nr_pending++;
>>>> +	conf->nr_pending[idx]++;
>>>>  	spin_unlock_irq(&conf->resync_lock);
>>>> -	return sector;
>>>>  }
>>>>  
>>>> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
>>>> -			  sector_t bi_sector)
>>>> +static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
>>>> +{
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>> +
>>>> +	_wait_barrier(conf, idx);
>>>> +}
>>>> +
>>>> +static void wait_all_barriers(struct r1conf *conf)
>>>> +{
>>>> +	long idx;
>>>> +
>>>> +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
>>>> +		_wait_barrier(conf, idx);
>>>
>>> This is racy. Since resync is sequential, idealy we only need to check the
>>> bucket sequentially. The compiler could do  _wait_barrier(conf, 1) and then
>>> _wait_barrier(conf, 0). It's possible the bucket 1 has barrier right after the
>>> check of bucket 0. Even the compiler doesn't reorder, there is case the bucket
>>> 511 should be check before bucket 0 if the resync is just crossing 512 buckets.
>>> It would be better we check the resync position and adjust the bucket wait
>>> order according to the position.
>>>
>>
>> wait_all_barriers() and allow_all_barriers() are used in close_sync() only,
>>  static void close_sync(struct r1conf *conf)
>>  {
>> -	wait_barrier(conf, NULL);
>> -	allow_barrier(conf, 0, 0);
>> +	wait_all_barriers(conf);
>> +	allow_all_barriers(conf);
>> [snip]
>>  }
>>
>> wait_barrier() and allow_barrier() called here is for a synchronization
>> purpose, to make sure before close_sync() returns there is no resync I/O
>> existing.
>>
>> close_sync() is called in raid1_sync_request() when resync I/O exceeds
>> the size of raid1 device, and resync should be closed. In this
>> condition, there won't be any new resync I/O generated. If a
>> conf->barrier[idx] reaches 0, it won't increase before a new resync
>> restarts. Therefore it is safe to call _wait_barrier() one by one, until
>> every conf->barrier[idx] reaches to 0.
>>
>> This is a usage of wait/allow_barrier() which is not mentioned in the
>> code. They are not for regular I/O waiting for resync I/O, they are used
>> as a synchronization to make sure all on-flying resync I/O to complete.
> 
> thanks, this makes sense.
> 
>>>>  	int good_sectors = RESYNC_SECTORS;
>>>>  	int min_bad = 0; /* number of sectors that are bad in all devices */
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>>  
>>>>  	if (!conf->r1buf_pool)
>>>>  		if (init_resync(conf))
>>>> @@ -2535,7 +2571,7 @@ static sector_t raid1_sync_request(struc
>>>>  	 * If there is non-resync activity waiting for a turn, then let it
>>>>  	 * though before starting on this new sync request.
>>>>  	 */
>>>> -	if (conf->nr_waiting)
>>>> +	if (conf->nr_waiting[idx])
>>>>  		schedule_timeout_uninterruptible(1);
>>>>  
>>>>  	/* we are incrementing sector_nr below. To be safe, we check against
>>>> @@ -2562,6 +2598,8 @@ static sector_t raid1_sync_request(struc
>>>>  	r1_bio->sector = sector_nr;
>>>>  	r1_bio->state = 0;
>>>>  	set_bit(R1BIO_IsSync, &r1_bio->state);
>>>> +	/* make sure good_sectors won't go across barrier unit border */
>>>> +	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
>>>>  
>>>>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>>>>  		struct md_rdev *rdev;
>>>> @@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct
>>>>  	if (!conf)
>>>>  		goto abort;
>>>>  
>>>> +	conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL);
>>>> +	if (!conf->nr_pending)
>>>> +		goto abort;
>>>
>>> This allocation is a little wierd. I'd define a count and uses
>>> sizeof(nr_pending) * count to do the allocation. There is nothing related to
>>> PAGE_SIZE. Alternatively, just make nr_pending an array in r1conf.
>>
>> This is just for better memory usage, r1conf is allocated by slab
>> allocator, large not aligned size may cause internal memory
>> fragmentation inside slab's pages. call kzalloc() with PAGE_SIZE will
>> avoid such issue.
> 
> Not sure if slab has closest slab for this, but that's not important. I think
> sizeof(nr_pending) * count makes more sense here, even sizeof(nr_pending)
> * count == PAGE_SIZE.
> 

Okey, I will move these counters into r1conf, in next version.

Thanks for your review and suggestion !

Coly

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html