Re: [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window

Coly Li <colyli@xxxxxxx> · Mon, 16 Jan 2017 17:06:46 +0800

On 2017/1/6 上午7:08, NeilBrown wrote:
> On Wed, Dec 28 2016, Coly Li wrote:
> 
>> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of
>> iobarrier.")' introduces a sliding resync window for raid1 I/O
>> barrier, this idea limits I/O barriers to happen only inside a
>> slidingresync window, for regular I/Os out of this resync window
>> they don't need to wait for barrier any more. On large raid1
>> device, it helps a lot to improve parallel writing I/O throughput
>> when there are background resync I/Os performing at same time.
>> 
>> The idea of sliding resync widow is awesome, but there are
>> several challenges are very difficult to solve, - code
>> complexity Sliding resync window requires several veriables to
>> work collectively, this is complexed and very hard to make it
>> work correctly. Just grep "Fixes: 79ef3a8aa1" in kernel git log,
>> there are 8 more patches to fix the original resync window patch.
>> This is not the end, any further related modification may easily
>> introduce more regreassion. - multiple sliding resync windows 
>> Currently raid1 code only has a single sliding resync window, we
>> cannot do parallel resync with current I/O barrier
>> implementation. Implementing multiple resync windows are much
>> more complexed, and very hard to make it correctly.
> 
> I think I've asked this before, but why do you think that parallel 
> resync might ever be a useful idea?  I don't think it makes any
> sense, so it is wrong for you use it as part of the justification
> for this patch. Just don't mention it at all unless you have a
> genuine expectation that it would really be a good thing, in which
> case: explain the value.
> 

I will remove this from the patch log. Thanks for your suggestion.

>> 
>> Therefore I decide to implement a much simpler raid1 I/O barrier,
>> by removing resync window code, I believe life will be much
>> easier.
>> 
>> The brief idea of the simpler barrier is, - Do not maintain a
>> logbal unique resync window - Use multiple hash buckets to reduce
>> I/O barrier conflictions, regular I/O only has to wait for a
>> resync I/O when both them have same barrier bucket index, vice
>> versa. - I/O barrier can be recuded to an acceptable number if
>> there are enought barrier buckets
>> 
>> Here I explain how the barrier buckets are designed, -
>> BARRIER_UNIT_SECTOR_SIZE The whole LBA address space of a raid1
>> device is divided into multiple barrier units, by the size of
>> BARRIER_UNIT_SECTOR_SIZE. Bio request won't go across border of
>> barrier unit size, that means maximum bio size is
>> BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
> 
> It would be good to say here what number you chose, and why you
> chose it. You have picked 64MB.  This divides a 1TB device into
> 4096 regions. Any write request must fit into one of these regions,
> so we mustn't make the region too small, else we would get the
> benefits for sending large requests down.
> 
> We want the resync to move from region to region fairly quickly so
> that the slowness caused by having to synchronize with the resync
> is averaged out overa fairly small time frame.  At full speed, 64MB
> should take less than 1 second.  When resync is competing with
> other IO, it could easily take up to a minute(?).  I think that is
> a fairly good range.
> 
> So I think 64MB is probably a very good choice.  I just would like
> to see the justification clearly stated.

I see, I will add text to explain why I choose 64MB bucket unit size.

A reason for 64MB is just as you mentioned, that's the trade off
between memory consume and hash conflict rate. I did some calculation
for md raid1 targe size from 1TB to 10PB, and the maximum I/O
throughput of NVMe SSD, finally I deside a bucket size between
64~128MB, bucket number between 512~1024 are proper numbers.

I will explain the calculation in detail in next version patch.

> 
>> - BARRIER_BUCKETS_NR There are BARRIER_BUCKETS_NR buckets in
>> total, which is defined by, #define BARRIER_BUCKETS_NR_BITS   9 
>> #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
> 
> Why 512 buckets?  What are the tradeoffs? More buckets means more
> memory consumed for counters. Fewer buckets means more false
> sharing. With 512 buckets, a request which is smaller than the
> region size has a 0.2% chance of having to wait for resync to
> pause.  I think that is quite a small enough fraction. I think you
> originally chose the number of buckets so that a set of 4-byte
> counters fits exactly into a page.  I think that is still a good 
> guideline, so I would have #define BARRIER_BUCKETS_NR_BITS
> (PAGE_SHIFT - 2) (which makes it 10 ...).
> 

Good suggestion, 1024 buckets makes less hash conflict in each bucket.
I will change it in next version patch.

>> if multiple I/O requests hit different barrier units, they only
>> need to compete I/O barrier with other I/Os which hit the same
>> barrier bucket index with each other. The index of a barrier
>> bucket which a bio should look for is calculated by, int idx =
>> hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS)
> 
> This isn't right.  You have to divide by BARRIER_UNIT_SECTOR_SIZE
> first. int idx = hash_long(sector_nr >> BARRIER_UNIT_SECTOR_BITS,
> BARRIER_BUCKETS_NR_BITS);
> 

Oops, thanks for catching this. I will fix it in next version patch.

>> that sector_nr is the start sector number of a bio. We use
>> function align_to_barrier_unit_end() to calculate sectors number
>> from sector_nr to the next barrier unit size boundary, if the
>> requesting bio size goes across the boundary, we split the bio in
>> raid1_make_request(), to make sure the finall bio sent into
>> generic_make_request() won't exceed barrier unit boundary.
>> 
>> Comparing to single sliding resync window, - Currently resync I/O
>> grows linearly, therefore regular and resync I/O will have
>> confliction within a single barrier units. So it is similar to 
>> single sliding resync window. - But a barrier unit bucket is
>> shared by all barrier units with identical barrier uinit index,
>> the probability of confliction might be higher than single
>> sliding resync window, in condition that writing I/Os always hit
>> barrier units which have identical barrier bucket index with the
>> resync I/Os. This is a very rare condition in real I/O work
>> loads, I cannot imagine how it could happen in practice. -
>> Therefore we can achieve a good enough low confliction rate with
>> much simpler barrier algorithm and implementation.
>> 
>> If user has a (realy) large raid1 device, for example 10PB size,
>> we may just increase the buckets number BARRIER_BUCKETS_NR. Now
>> this is a macro, it is possible to be a
>> raid1-created-time-defined variable in future.
> 
> Why?  Why would a large array require more buckets?  Are you just 
> guessing, or do you see some concrete reason for there to be a 
> relationship between the size of the array and the number of
> buckets? If you can see a connection, please state it.  If not,
> don't mention it.
> 

This is a assumption. OK I will remove these text from the patch log.

>> 
>> There are two changes should be noticed, - In raid1d(), I change
>> the code to decrease conf->nr_pending[idx] into single loop, it
>> looks like this, spin_lock_irqsave(&conf->device_lock, flags); 
>> conf->nr_queued[idx]--; 
>> spin_unlock_irqrestore(&conf->device_lock, flags); This change
>> generates more spin lock operations, but in next patch of this
>> patch set, it will be replaced by a single line code, 
>> atomic_dec(conf->nr_queueud[idx]); So we don't need to worry
>> about spin lock cost here. - Original function
>> raid1_make_request() is split into two functions, -
>> raid1_make_read_request(): handles regular read request and
>> calls wait_read_barrier() for I/O barrier. -
>> raid1_make_write_request(): handles regular write request and
>> calls wait_barrier() for I/O barrier. The differnece is
>> wait_read_barrier() only waits if array is frozen, using
>> different barrier function in different code path makes the code 
>> more clean and easy to read. - align_to_barrier_unit_end() is
>> called to make sure both regular and resync I/O won't go across a
>> barrier unit boundary.
>> 
>> Changelog V1: - Original RFC patch for comments V2: - Use
>> bio_split() to split the orignal bio if it goes across barrier
>> unit bounday, to make the code more simple, by suggestion from
>> Shaohua and Neil. - Use hash_long() to replace original linear
>> hash, to avoid a possible confilict between resync I/O and
>> sequential write I/O, by suggestion from Shaohua. - Add
>> conf->total_barriers to record barrier depth, which is used to 
>> control number of parallel sync I/O barriers, by suggestion from
>> Shaohua.
> 
> I really don't think this is needed. As long as RESYNC_DEPTH *
> RESYNC_SECTORS is less than BARRIER_UNIT_SECTOR_SIZE just testing
> again ->barrier[idx] will ensure the number of barrier requests
> never exceeds RESYNC_DEPTH*2.  That is sufficient.
> 
> Also, I think the reason for imposing the RESYNC_DEPTH limit is to
> make sure regular IO never has to wait too long for pending resync
> requests to flush.  With the simple test, regular IO will never
> need to wait for more than RESYNC_DEPTH requests to complete.
> 
> So I think have this field brings no valid, and is potentially
> confusing.
> 

Ok, I will remove conf->total_barriers and back to the original
implementation. IMHO, I think this is a threshold for hard disk, for
SSD this limitation could be much larger, that's why the original
version there is no conf->total_barrier.

I will fix in next version patch.

>> - In V1 patch the bellowed barrier buckets related members in
>> r1conf are allocated in memory page. To make the code more
>> simple, V2 patch moves the memory space into struct r1conf, like
>> this, -       int                     nr_pending; -       int
>> nr_waiting; -       int                     nr_queued; -
>> int                     barrier; +       int
>> nr_pending[BARRIER_BUCKETS_NR]; +       int
>> nr_waiting[BARRIER_BUCKETS_NR]; +       int
>> nr_queued[BARRIER_BUCKETS_NR]; +       int
>> barrier[BARRIER_BUCKETS_NR];
> 
> I don't like this.  It makes the r1conf 4 pages is size, most of
> which is wasted.  A 4-page allocation is more likely to fail than a
> few 1-page allocations. I think these should be:
>> +       int                     *nr_pending; +       int
>> *nr_waiting; +       int                     *nr_queued; +
>> int                     *barrier;
> 
> Then use kcalloc(BARRIER_BUCKETS_NR, sizeof(int), GFP_KERNEL) to
> allocate each array.   I think this approach addresses Shaohua's 
> concerns without requiring a multi-page allocation.
> 

Very constructive suggestion. Yes, I will do this change in next
version patch.

>> This change is by the suggestion from Shaohua. - Remove some
>> inrelavent code comments, by suggestion from Guoqing. - Add a
>> missing wait_barrier() before jumping to retry_write, in 
>> raid1_make_write_request().
>> 
>> Signed-off-by: Coly Li <colyli@xxxxxxx> Cc: Shaohua Li
>> <shli@xxxxxx> Cc: Neil Brown <neilb@xxxxxxx> Cc: Johannes
>> Thumshirn <jthumshirn@xxxxxxx> Cc: Guoqing Jiang
>> <gqjiang@xxxxxxxx> --- drivers/md/raid1.c | 485
>> ++++++++++++++++++++++++++++++----------------------- 
>> drivers/md/raid1.h |  37 ++-- 2 files changed, 291 insertions(+),
>> 231 deletions(-)
>> 
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index
>> a1f3fbe..5813656 100644 --- a/drivers/md/raid1.c +++
>> b/drivers/md/raid1.c @@ -67,9 +67,8 @@ */ static int
>> max_queued_requests = 1024;
>> 
>> -static void allow_barrier(struct r1conf *conf, sector_t
>> start_next_window, -			  sector_t bi_sector); -static void
>> lower_barrier(struct r1conf *conf); +static void
>> allow_barrier(struct r1conf *conf, sector_t sector_nr); +static
>> void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>> 
>> #define raid1_log(md, fmt, args...)				\ do { if ((md)->queue)
>> blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while
>> (0) @@ -96,7 +95,6 @@ static void r1bio_pool_free(void *r1_bio,
>> void *data) #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9) 
>> #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW) #define
>> CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9) 
>> -#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
>> 
>> static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data) { @@
>> -211,7 +209,7 @@ static void put_buf(struct r1bio *r1_bio)
>> 
>> mempool_free(r1_bio, conf->r1buf_pool);
>> 
>> -	lower_barrier(conf); +	lower_barrier(conf, r1_bio->sector); }
>> 
>> static void reschedule_retry(struct r1bio *r1_bio) @@ -219,10
>> +217,12 @@ static void reschedule_retry(struct r1bio *r1_bio) 
>> unsigned long flags; struct mddev *mddev = r1_bio->mddev; struct
>> r1conf *conf = mddev->private; +	int idx;
>> 
>> +	idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS); 
>> spin_lock_irqsave(&conf->device_lock, flags); 
>> list_add(&r1_bio->retry_list, &conf->retry_list); -
>> conf->nr_queued ++; +	conf->nr_queued[idx]++; 
>> spin_unlock_irqrestore(&conf->device_lock, flags);
>> 
>> wake_up(&conf->wait_barrier); @@ -239,8 +239,6 @@ static void
>> call_bio_endio(struct r1bio *r1_bio) struct bio *bio =
>> r1_bio->master_bio; int done; struct r1conf *conf =
>> r1_bio->mddev->private; -	sector_t start_next_window =
>> r1_bio->start_next_window; -	sector_t bi_sector =
>> bio->bi_iter.bi_sector;
>> 
>> if (bio->bi_phys_segments) { unsigned long flags; @@ -265,7
>> +263,7 @@ static void call_bio_endio(struct r1bio *r1_bio) * Wake
>> up any possible resync thread that waits for the device * to go
>> idle. */ -		allow_barrier(conf, start_next_window, bi_sector); +
>> allow_barrier(conf, bio->bi_iter.bi_sector);
> 
> Why did you change this to use "bio->bi_iter.bi_sector" instead of 
> "bi_sector"?
> 
> I assume you thought it was an optimization that you would just
> slip in.  Can't hurt, right?
> 
> Just before this line is: bio_endio(bio); and that might cause the
> bio to be freed.  So your code could access freed memory.
> 
> Please be *very* cautious when making changes that are not
> directly related to the purpose of the patch.

Copied, I will fix it in next version patch. This is a regression I
introduced when I use bio_split() and move the location of
allow_barrier(). Thanks for catching this.

>> } }
>> 
>> @@ -513,6 +511,25 @@ static void raid1_end_write_request(struct
>> bio *bio) bio_put(to_put); }
>> 
>> +static sector_t align_to_barrier_unit_end(sector_t
>> start_sector, +					  sector_t sectors) +{ +	sector_t len; + +
>> WARN_ON(sectors == 0); +	/* len is the number of sectors from
>> start_sector to end of the +	 * barrier unit which start_sector
>> belongs to. +	 */ +	len = ((start_sector + sectors +
>> (1<<BARRIER_UNIT_SECTOR_BITS) - 1) & +
>> (~(BARRIER_UNIT_SECTOR_SIZE - 1))) - +	      start_sector;
> 
> This would be better as
> 
> len = round_up(start_sector+1, BARRIER_UNIT_SECTOR_SIZE) -
> start_sector;
> 

Aha! Yes, I will modify the code this way. Thanks for the suggestion.

> 
>> + +	if (len > sectors) +		len = sectors; + +	return len; +} + /* 
>> * This routine returns the disk from which the requested read
>> should * be done. There is a per-array 'next expected sequential
>> IO' sector @@ -809,168 +826,179 @@ static void
>> flush_pending_writes(struct r1conf *conf) */ static void
>> raise_barrier(struct r1conf *conf, sector_t sector_nr) { +	int
>> idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS); + 
>> spin_lock_irq(&conf->resync_lock);
>> 
>> /* Wait until no block IO is waiting */ -
>> wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting, +
>> wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx], 
>> conf->resync_lock);
>> 
>> /* block any new IO from starting */ -	conf->barrier++; -
>> conf->next_resync = sector_nr; +	conf->barrier[idx]++; +
>> conf->total_barriers++;
>> 
>> /* For these conditions we must wait: * A: while the array is in
>> frozen state -	 * B: while barrier >= RESYNC_DEPTH, meaning
>> resync reach -	 *    the max count which allowed. -	 * C:
>> next_resync + RESYNC_SECTORS > start_next_window, meaning -	 *
>> next resync will reach to the window which normal bios are -	 *
>> handling. -	 * D: while there are any active requests in the
>> current window. +	 * B: while conf->nr_pending[idx] is not 0,
>> meaning regular I/O +	 *    existing in sector number ranges
>> corresponding to idx. +	 * C: while conf->total_barriers >=
>> RESYNC_DEPTH, meaning resync reach +	 *    the max count which
>> allowed on the whole raid1 device. */ 
>> wait_event_lock_irq(conf->wait_barrier, !conf->array_frozen && -
>> conf->barrier < RESYNC_DEPTH && -
>> conf->current_window_requests == 0 && -
>> (conf->start_next_window >= -			     conf->next_resync +
>> RESYNC_SECTORS), +			     !conf->nr_pending[idx] && +
>> conf->total_barriers < RESYNC_DEPTH, conf->resync_lock);
>> 
>> -	conf->nr_pending++; +	conf->nr_pending[idx]++; 
>> spin_unlock_irq(&conf->resync_lock); }
>> 
>> -static void lower_barrier(struct r1conf *conf) +static void
>> lower_barrier(struct r1conf *conf, sector_t sector_nr) { unsigned
>> long flags; -	BUG_ON(conf->barrier <= 0); +	int idx =
>> hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS); + +
>> BUG_ON((conf->barrier[idx] <= 0) || conf->total_barriers <= 0); 
>> + spin_lock_irqsave(&conf->resync_lock, flags); -
>> conf->barrier--; -	conf->nr_pending--; +	conf->barrier[idx]--; +
>> conf->total_barriers--; +	conf->nr_pending[idx]--; 
>> spin_unlock_irqrestore(&conf->resync_lock, flags); 
>> wake_up(&conf->wait_barrier); }
>> 
>> -static bool need_to_wait_for_sync(struct r1conf *conf, struct
>> bio *bio) +static void _wait_barrier(struct r1conf *conf, int
>> idx) { -	bool wait = false; - -	if (conf->array_frozen || !bio) -
>> wait = true; -	else if (conf->barrier && bio_data_dir(bio) ==
>> WRITE) { -		if ((conf->mddev->curr_resync_completed -		     >=
>> bio_end_sector(bio)) || -		    (conf->start_next_window +
>> NEXT_NORMALIO_DISTANCE -		     <= bio->bi_iter.bi_sector)) -
>> wait = false; -		else -			wait = true; +
>> spin_lock_irq(&conf->resync_lock); +	if (conf->array_frozen ||
>> conf->barrier[idx]) { +		conf->nr_waiting[idx]++; +		/* Wait for
>> the barrier to drop. */ +		wait_event_lock_irq( +
>> conf->wait_barrier, +			!conf->array_frozen &&
>> !conf->barrier[idx], +			conf->resync_lock); +
>> conf->nr_waiting[idx]--; }
>> 
>> -	return wait; +	conf->nr_pending[idx]++; +
>> spin_unlock_irq(&conf->resync_lock); }
>> 
>> -static sector_t wait_barrier(struct r1conf *conf, struct bio
>> *bio) +static void wait_read_barrier(struct r1conf *conf,
>> sector_t sector_nr) { -	sector_t sector = 0; +	long idx =
>> hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
>> 
>> spin_lock_irq(&conf->resync_lock); -	if
>> (need_to_wait_for_sync(conf, bio)) { -		conf->nr_waiting++; -		/*
>> Wait for the barrier to drop. -		 * However if there are already
>> pending -		 * requests (preventing the barrier from -		 * rising
>> completely), and the -		 * per-process bio queue isn't empty, -
>> * then don't wait, as we need to empty -		 * that queue to allow
>> conf->start_next_window -		 * to increase. -		 */ -
>> raid1_log(conf->mddev, "wait barrier"); -
>> wait_event_lock_irq(conf->wait_barrier, -
>> !conf->array_frozen && -				    (!conf->barrier || -
>> ((conf->start_next_window < -				       conf->next_resync +
>> RESYNC_SECTORS) && -				      current->bio_list && -
>> !bio_list_empty(current->bio_list))), -
>> conf->resync_lock); -		conf->nr_waiting--; -	} - -	if (bio &&
>> bio_data_dir(bio) == WRITE) { -		if (bio->bi_iter.bi_sector >=
>> conf->next_resync) { -			if (conf->start_next_window ==
>> MaxSector) -				conf->start_next_window = -					conf->next_resync
>> + -					NEXT_NORMALIO_DISTANCE; - -			if
>> ((conf->start_next_window + NEXT_NORMALIO_DISTANCE) -			    <=
>> bio->bi_iter.bi_sector) -				conf->next_window_requests++; -
>> else -				conf->current_window_requests++; -			sector =
>> conf->start_next_window; -		} +	if (conf->array_frozen) { +
>> conf->nr_waiting[idx]++; +		/* Wait for array to unfreeze */ +
>> wait_event_lock_irq( +			conf->wait_barrier, +
>> !conf->array_frozen, +			conf->resync_lock); +
>> conf->nr_waiting[idx]--; }
>> 
>> -	conf->nr_pending++; +	conf->nr_pending[idx]++; 
>> spin_unlock_irq(&conf->resync_lock); -	return sector; }
>> 
>> -static void allow_barrier(struct r1conf *conf, sector_t
>> start_next_window, -			  sector_t bi_sector) +static void
>> wait_barrier(struct r1conf *conf, sector_t sector_nr) +{ +	int
>> idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS); + +
>> _wait_barrier(conf, idx); +} + +static void
>> wait_all_barriers(struct r1conf *conf) +{ +	int idx; + +	for (idx
>> = 0; idx < BARRIER_BUCKETS_NR; idx++) +		_wait_barrier(conf,
>> idx); +} + +static void _allow_barrier(struct r1conf *conf, int
>> idx) { unsigned long flags;
>> 
>> spin_lock_irqsave(&conf->resync_lock, flags); -
>> conf->nr_pending--; -	if (start_next_window) { -		if
>> (start_next_window == conf->start_next_window) { -			if
>> (conf->start_next_window + NEXT_NORMALIO_DISTANCE -			    <=
>> bi_sector) -				conf->next_window_requests--; -			else -
>> conf->current_window_requests--; -		} else -
>> conf->current_window_requests--; - -		if
>> (!conf->current_window_requests) { -			if
>> (conf->next_window_requests) { -				conf->current_window_requests
>> = -					conf->next_window_requests; -
>> conf->next_window_requests = 0; -				conf->start_next_window += -
>> NEXT_NORMALIO_DISTANCE; -			} else -				conf->start_next_window =
>> MaxSector; -		} -	} +	conf->nr_pending[idx]--; 
>> spin_unlock_irqrestore(&conf->resync_lock, flags); 
>> wake_up(&conf->wait_barrier); }
>> 
>> +static void allow_barrier(struct r1conf *conf, sector_t
>> sector_nr) +{ +	int idx = hash_long(sector_nr,
>> BARRIER_BUCKETS_NR_BITS); + +	_allow_barrier(conf, idx); +} + 
>> +static void allow_all_barriers(struct r1conf *conf) +{ +	int
>> idx; + +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++) +
>> _allow_barrier(conf, idx); +} + +/* conf->resync_lock should be
>> held */ +static int get_all_pendings(struct r1conf *conf) +{ +
>> int idx, ret; + +	for (ret = 0, idx = 0; idx <
>> BARRIER_BUCKETS_NR; idx++) +		ret += conf->nr_pending[idx]; +
>> return ret; +} + +/* conf->resync_lock should be held */ +static
>> int get_all_queued(struct r1conf *conf) +{ +	int idx, ret; + +
>> for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++) +		ret +=
>> conf->nr_queued[idx]; +	return ret; +} + static void
>> freeze_array(struct r1conf *conf, int extra) { -	/* stop syncio
>> and normal IO and wait for everything to +	/* Stop sync I/O and
>> normal I/O and wait for everything to * go quite. -	 * We wait
>> until nr_pending match nr_queued+extra -	 * This is called in the
>> context of one normal IO request -	 * that has failed. Thus any
>> sync request that might be pending -	 * will be blocked by
>> nr_pending, and we need to wait for -	 * pending IO requests to
>> complete or be queued for re-try. -	 * Thus the number queued
>> (nr_queued) plus this request (extra) -	 * must match the number
>> of pending IOs (nr_pending) before -	 * we continue. +	 * This is
>> called in two situations: +	 * 1) management command handlers
>> (reshape, remove disk, quiesce). +	 * 2) one normal I/O request
>> failed. + +	 * After array_frozen is set to 1, new sync IO will
>> be blocked at +	 * raise_barrier(), and new normal I/O will
>> blocked at _wait_barrier(). +	 * The flying I/Os will either
>> complete or be queued. When everything +	 * goes quite, there are
>> only queued I/Os left. + +	 * Every flying I/O contributes to a
>> conf->nr_pending[idx], idx is the +	 * barrier bucket index which
>> this I/O request hits. When all sync and +	 * normal I/O are
>> queued, sum of all conf->nr_pending[] will match sum +	 * of all
>> conf->nr_queued[]. But normal I/O failure is an exception, +	 *
>> in handle_read_error(), we may call freeze_array() before trying
>> to +	 * fix the read error. In this case, the error read I/O is
>> not queued, +	 * so get_all_pending() == get_all_queued() + 1. +
>> * +	 * Therefore before this function returns, we need to wait
>> until +	 * get_all_pendings(conf) gets equal to
>> get_all_queued(conf)+extra. For +	 * normal I/O context, extra is
>> 1, in rested situations extra is 0. */ 
>> spin_lock_irq(&conf->resync_lock); conf->array_frozen = 1; 
>> raid1_log(conf->mddev, "wait freeze"); -
>> wait_event_lock_irq_cmd(conf->wait_barrier, -				conf->nr_pending
>> == conf->nr_queued+extra, -				conf->resync_lock, -
>> flush_pending_writes(conf)); +	wait_event_lock_irq_cmd( +
>> conf->wait_barrier, +		get_all_pendings(conf) ==
>> get_all_queued(conf)+extra, +		conf->resync_lock, +
>> flush_pending_writes(conf)); 
>> spin_unlock_irq(&conf->resync_lock); } static void
>> unfreeze_array(struct r1conf *conf) @@ -1066,64 +1094,23 @@
>> static void raid1_unplug(struct blk_plug_cb *cb, bool
>> from_schedule) kfree(plug); }
>> 
>> -static void raid1_make_request(struct mddev *mddev, struct bio *
>> bio) +static void raid1_make_read_request(struct mddev *mddev,
>> struct bio *bio) { struct r1conf *conf = mddev->private; struct
>> raid1_info *mirror; struct r1bio *r1_bio; struct bio *read_bio; -
>> int i, disks; struct bitmap *bitmap; -	unsigned long flags; const
>> int op = bio_op(bio); -	const int rw = bio_data_dir(bio); const
>> unsigned long do_sync = (bio->bi_opf & REQ_SYNC); -	const
>> unsigned long do_flush_fua = (bio->bi_opf & -						(REQ_PREFLUSH
>> | REQ_FUA)); -	struct md_rdev *blocked_rdev; -	struct blk_plug_cb
>> *cb; -	struct raid1_plug_cb *plug = NULL; -	int first_clone; int
>> sectors_handled; int max_sectors; -	sector_t start_next_window; +
>> int rdisk;
>> 
>> -	/* -	 * Register the new request and wait if the
>> reconstruction -	 * thread has put up a bar for new requests. -
>> * Continue immediately if no resync is active currently. +	/*
>> Still need barrier for READ in case that whole +	 * array is
>> frozen. */ - -	md_write_start(mddev, bio); /* wait on superblock
>> update early */ - -	if (bio_data_dir(bio) == WRITE && -
>> ((bio_end_sector(bio) > mddev->suspend_lo && -
>> bio->bi_iter.bi_sector < mddev->suspend_hi) || -
>> (mddev_is_clustered(mddev) && -
>> md_cluster_ops->area_resyncing(mddev, WRITE, -
>> bio->bi_iter.bi_sector, bio_end_sector(bio))))) { -		/* As the
>> suspend_* range is controlled by -		 * userspace, we want an
>> interruptible -		 * wait. -		 */ -		DEFINE_WAIT(w); -		for (;;)
>> { -			flush_signals(current); -
>> prepare_to_wait(&conf->wait_barrier, -					&w,
>> TASK_INTERRUPTIBLE); -			if (bio_end_sector(bio) <=
>> mddev->suspend_lo || -			    bio->bi_iter.bi_sector >=
>> mddev->suspend_hi || -			    (mddev_is_clustered(mddev) && -
>> !md_cluster_ops->area_resyncing(mddev, WRITE, -
>> bio->bi_iter.bi_sector, bio_end_sector(bio)))) -				break; -
>> schedule(); -		} -		finish_wait(&conf->wait_barrier, &w); -	} - -
>> start_next_window = wait_barrier(conf, bio); - +
>> wait_read_barrier(conf, bio->bi_iter.bi_sector); bitmap =
>> mddev->bitmap;
>> 
>> /* @@ -1149,12 +1136,9 @@ static void raid1_make_request(struct
>> mddev *mddev, struct bio * bio) bio->bi_phys_segments = 0; 
>> bio_clear_flag(bio, BIO_SEG_VALID);
>> 
>> -	if (rw == READ) { /* * read balancing logic: */ -		int rdisk; 
>> - read_again: rdisk = read_balance(conf, r1_bio, &max_sectors);
>> 
>> @@ -1176,7 +1160,6 @@ static void raid1_make_request(struct mddev
>> *mddev, struct bio * bio) atomic_read(&bitmap->behind_writes) ==
>> 0); } r1_bio->read_disk = rdisk; -		r1_bio->start_next_window =
>> 0;
>> 
>> read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev); 
>> bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector, @@
>> -1232,11 +1215,89 @@ static void raid1_make_request(struct mddev
>> *mddev, struct bio * bio) } else generic_make_request(read_bio); 
>> return; +} + +static void raid1_make_write_request(struct mddev
>> *mddev, struct bio *bio) +{ +	struct r1conf *conf =
>> mddev->private; +	struct r1bio *r1_bio; +	int i, disks; +	struct
>> bitmap *bitmap; +	unsigned long flags; +	const int op =
>> bio_op(bio); +	const unsigned long do_sync = (bio->bi_opf &
>> REQ_SYNC); +	const unsigned long do_flush_fua = (bio->bi_opf & +
>> (REQ_PREFLUSH | REQ_FUA)); +	struct md_rdev *blocked_rdev; +
>> struct blk_plug_cb *cb; +	struct raid1_plug_cb *plug = NULL; +
>> int first_clone; +	int sectors_handled; +	int max_sectors; + +
>> /* +	 * Register the new request and wait if the reconstruction +
>> * thread has put up a bar for new requests. +	 * Continue
>> immediately if no resync is active currently. +	 */ + +
>> md_write_start(mddev, bio); /* wait on superblock update early
>> */ + +	if (((bio_end_sector(bio) > mddev->suspend_lo && +
>> bio->bi_iter.bi_sector < mddev->suspend_hi) || +
>> (mddev_is_clustered(mddev) && +
>> md_cluster_ops->area_resyncing(mddev, WRITE, +
>> bio->bi_iter.bi_sector, bio_end_sector(bio))))) { +		/* As the
>> suspend_* range is controlled by +		 * userspace, we want an
>> interruptible +		 * wait. +		 */ +		DEFINE_WAIT(w); + +		for (;;)
>> { +			flush_signals(current); +
>> prepare_to_wait(&conf->wait_barrier, +					&w,
>> TASK_INTERRUPTIBLE); +			if (bio_end_sector(bio) <=
>> mddev->suspend_lo || +			    bio->bi_iter.bi_sector >=
>> mddev->suspend_hi || +			    (mddev_is_clustered(mddev) && +
>> !md_cluster_ops->area_resyncing( +						mddev, +						WRITE, +
>> bio->bi_iter.bi_sector, +						bio_end_sector(bio)))) +
>> break; +			schedule(); +		} +		finish_wait(&conf->wait_barrier,
>> &w); }
>> 
>> +	wait_barrier(conf, bio->bi_iter.bi_sector); +	bitmap =
>> mddev->bitmap; + /* -	 * WRITE: +	 * make_request() can abort the
>> operation when read-ahead is being +	 * used and no empty request
>> is available. +	 * +	 */ +	r1_bio =
>> mempool_alloc(conf->r1bio_pool, GFP_NOIO); + +	r1_bio->master_bio
>> = bio; +	r1_bio->sectors = bio_sectors(bio); +	r1_bio->state =
>> 0; +	r1_bio->mddev = mddev; +	r1_bio->sector =
>> bio->bi_iter.bi_sector; + +	/* We might need to issue multiple
>> reads to different +	 * devices if there are bad blocks around,
>> so we keep +	 * track of the number of reads in
>> bio->bi_phys_segments. +	 * If this is 0, there is only one
>> r1_bio and no locking +	 * will be needed when requests complete.
>> If it is +	 * non-zero, then it is the number of not-completed
>> requests.
> 
> This comment mentions "reads".  It should probably be changed to
> discuss what happens to "writes" since this is
> raid1_make_write_request().
> 

Yes, I will fix this.

>> */ +	bio->bi_phys_segments = 0; +	bio_clear_flag(bio,
>> BIO_SEG_VALID); + if (conf->pending_count >= max_queued_requests)
>> { md_wakeup_thread(mddev->thread); raid1_log(mddev, "wait
>> queued"); @@ -1256,7 +1317,6 @@ static void
>> raid1_make_request(struct mddev *mddev, struct bio * bio)
>> 
>> disks = conf->raid_disks * 2; retry_write: -
>> r1_bio->start_next_window = start_next_window; blocked_rdev =
>> NULL; rcu_read_lock(); max_sectors = r1_bio->sectors; @@ -1324,25
>> +1384,15 @@ static void raid1_make_request(struct mddev *mddev,
>> struct bio * bio) if (unlikely(blocked_rdev)) { /* Wait for this
>> device to become unblocked */ int j; -		sector_t old =
>> start_next_window;
>> 
>> for (j = 0; j < i; j++) if (r1_bio->bios[j]) 
>> rdev_dec_pending(conf->mirrors[j].rdev, mddev); r1_bio->state =
>> 0; -		allow_barrier(conf, start_next_window,
>> bio->bi_iter.bi_sector); +		allow_barrier(conf,
>> bio->bi_iter.bi_sector); raid1_log(mddev, "wait rdev %d blocked",
>> blocked_rdev->raid_disk); md_wait_for_blocked_rdev(blocked_rdev,
>> mddev); -		start_next_window = wait_barrier(conf, bio); -		/* -
>> * We must make sure the multi r1bios of bio have -		 * the same
>> value of bi_phys_segments -		 */ -		if (bio->bi_phys_segments &&
>> old && -		    old != start_next_window) -			/* Wait for the
>> former r1bio(s) to complete */ -
>> wait_event(conf->wait_barrier, -				   bio->bi_phys_segments ==
>> 1); +		wait_barrier(conf, bio->bi_iter.bi_sector); goto
>> retry_write; }
>> 
>> @@ -1464,6 +1514,31 @@ static void raid1_make_request(struct
>> mddev *mddev, struct bio * bio) wake_up(&conf->wait_barrier); }
>> 
>> +static void raid1_make_request(struct mddev *mddev, struct bio
>> *bio) +{ +	void (*make_request_fn)(struct mddev *mddev, struct
>> bio *bio); +	struct bio *split; +	sector_t sectors; + +
>> make_request_fn = (bio_data_dir(bio) == READ) ? +
>> raid1_make_read_request : +			  raid1_make_write_request; + +	/*
>> if bio exceeds barrier unit boundary, split it */ +	do { +
>> sectors = align_to_barrier_unit_end(bio->bi_iter.bi_sector, +
>> bio_sectors(bio)); +		if (sectors < bio_sectors(bio)) { +			split
>> = bio_split(bio, sectors, GFP_NOIO, fs_bio_set); +
>> bio_chain(split, bio); +		} else { +			split = bio; +		} + +
>> make_request_fn(mddev, split); +	} while (split != bio); +} + 
>> static void raid1_status(struct seq_file *seq, struct mddev
>> *mddev) { struct r1conf *conf = mddev->private; @@ -1552,19
>> +1627,11 @@ static void print_conf(struct r1conf *conf)
>> 
>> static void close_sync(struct r1conf *conf) { -
>> wait_barrier(conf, NULL); -	allow_barrier(conf, 0, 0); +
>> wait_all_barriers(conf); +	allow_all_barriers(conf);
>> 
>> mempool_destroy(conf->r1buf_pool); conf->r1buf_pool = NULL; - -
>> spin_lock_irq(&conf->resync_lock); -	conf->next_resync =
>> MaxSector - 2 * NEXT_NORMALIO_DISTANCE; -	conf->start_next_window
>> = MaxSector; -	conf->current_window_requests += -
>> conf->next_window_requests; -	conf->next_window_requests = 0; -
>> spin_unlock_irq(&conf->resync_lock); }
>> 
>> static int raid1_spare_active(struct mddev *mddev) @@ -2311,8
>> +2378,9 @@ static void handle_sync_write_finished(struct r1conf
>> *conf, struct r1bio *r1_bio
>> 
>> static void handle_write_finished(struct r1conf *conf, struct
>> r1bio *r1_bio) { -	int m; +	int m, idx; bool fail = false; + for
>> (m = 0; m < conf->raid_disks * 2 ; m++) if (r1_bio->bios[m] ==
>> IO_MADE_GOOD) { struct md_rdev *rdev = conf->mirrors[m].rdev; @@
>> -2338,7 +2406,8 @@ static void handle_write_finished(struct
>> r1conf *conf, struct r1bio *r1_bio) if (fail) { 
>> spin_lock_irq(&conf->device_lock); list_add(&r1_bio->retry_list,
>> &conf->bio_end_io_list); -		conf->nr_queued++; +		idx =
>> hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS); +
>> conf->nr_queued[idx]++; spin_unlock_irq(&conf->device_lock); 
>> md_wakeup_thread(conf->mddev->thread); } else { @@ -2460,6
>> +2529,7 @@ static void raid1d(struct md_thread *thread) struct
>> r1conf *conf = mddev->private; struct list_head *head =
>> &conf->retry_list; struct blk_plug plug; +	int idx;
>> 
>> md_check_recovery(mddev);
>> 
>> @@ -2467,17 +2537,18 @@ static void raid1d(struct md_thread
>> *thread) !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) { 
>> LIST_HEAD(tmp); spin_lock_irqsave(&conf->device_lock, flags); -
>> if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) { -
>> while (!list_empty(&conf->bio_end_io_list)) { -
>> list_move(conf->bio_end_io_list.prev, &tmp); -
>> conf->nr_queued--; -			} -		} +		if
>> (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) +
>> list_splice_init(&conf->bio_end_io_list, &tmp); 
>> spin_unlock_irqrestore(&conf->device_lock, flags); while
>> (!list_empty(&tmp)) { r1_bio = list_first_entry(&tmp, struct
>> r1bio, retry_list); list_del(&r1_bio->retry_list); +			idx =
>> hash_long(r1_bio->sector, +					BARRIER_BUCKETS_NR_BITS); +
>> spin_lock_irqsave(&conf->device_lock, flags); +
>> conf->nr_queued[idx]--; +
>> spin_unlock_irqrestore(&conf->device_lock, flags); if
>> (mddev->degraded) set_bit(R1BIO_Degraded, &r1_bio->state); if
>> (test_bit(R1BIO_WriteError, &r1_bio->state)) @@ -2498,7 +2569,8
>> @@ static void raid1d(struct md_thread *thread) } r1_bio =
>> list_entry(head->prev, struct r1bio, retry_list); 
>> list_del(head->prev); -		conf->nr_queued--; +		idx =
>> hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS); +
>> conf->nr_queued[idx]--; 
>> spin_unlock_irqrestore(&conf->device_lock, flags);
>> 
>> mddev = r1_bio->mddev; @@ -2537,7 +2609,6 @@ static int
>> init_resync(struct r1conf *conf) conf->poolinfo); if
>> (!conf->r1buf_pool) return -ENOMEM; -	conf->next_resync = 0; 
>> return 0; }
>> 
>> @@ -2566,6 +2637,7 @@ static sector_t raid1_sync_request(struct
>> mddev *mddev, sector_t sector_nr, int still_degraded = 0; int
>> good_sectors = RESYNC_SECTORS; int min_bad = 0; /* number of
>> sectors that are bad in all devices */ +	int idx =
>> hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
>> 
>> if (!conf->r1buf_pool) if (init_resync(conf)) @@ -2615,7 +2687,7
>> @@ static sector_t raid1_sync_request(struct mddev *mddev,
>> sector_t sector_nr, * If there is non-resync activity waiting for
>> a turn, then let it * though before starting on this new sync
>> request. */ -	if (conf->nr_waiting) +	if (conf->nr_waiting[idx]) 
>> schedule_timeout_uninterruptible(1);
>> 
>> /* we are incrementing sector_nr below. To be safe, we check
>> against @@ -2642,6 +2714,8 @@ static sector_t
>> raid1_sync_request(struct mddev *mddev, sector_t sector_nr, 
>> r1_bio->sector = sector_nr; r1_bio->state = 0; 
>> set_bit(R1BIO_IsSync, &r1_bio->state); +	/* make sure
>> good_sectors won't go across barrier unit boundary */ +
>> good_sectors = align_to_barrier_unit_end(sector_nr,
>> good_sectors);
>> 
>> for (i = 0; i < conf->raid_disks * 2; i++) { struct md_rdev
>> *rdev; @@ -2927,9 +3001,6 @@ static struct r1conf
>> *setup_conf(struct mddev *mddev) conf->pending_count = 0; 
>> conf->recovery_disabled = mddev->recovery_disabled - 1;
>> 
>> -	conf->start_next_window = MaxSector; -
>> conf->current_window_requests = conf->next_window_requests = 0; 
>> - err = -EIO; for (i = 0; i < conf->raid_disks * 2; i++) {
>> 
>> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h index
>> c52ef42..817115d 100644 --- a/drivers/md/raid1.h +++
>> b/drivers/md/raid1.h @@ -1,6 +1,14 @@ #ifndef _RAID1_H #define
>> _RAID1_H
>> 
>> +/* each barrier unit size is 64MB fow now + * note: it must be
>> larger than RESYNC_DEPTH + */ +#define BARRIER_UNIT_SECTOR_BITS
>> 17 +#define BARRIER_UNIT_SECTOR_SIZE	(1<<17) +#define
>> BARRIER_BUCKETS_NR_BITS		9 +#define BARRIER_BUCKETS_NR
>> (1<<BARRIER_BUCKETS_NR_BITS) + struct raid1_info { struct md_rdev
>> *rdev; sector_t	head_position; @@ -35,25 +43,6 @@ struct r1conf
>> { */ int			raid_disks;
>> 
>> -	/* During resync, read_balancing is only allowed on the part -
>> * of the array that has been resynced.  'next_resync' tells us -
>> * where that is. -	 */ -	sector_t		next_resync; - -	/* When raid1
>> starts resync, we divide array into four partitions -	 *
>> |---------|--------------|---------------------|-------------| -
>> *        next_resync   start_next_window       end_window -	 *
>> start_next_window = next_resync + NEXT_NORMALIO_DISTANCE -	 *
>> end_window = start_next_window + NEXT_NORMALIO_DISTANCE -	 *
>> current_window_requests means the count of normalIO between -	 *
>> start_next_window and end_window. -	 * next_window_requests means
>> the count of normalIO after end_window. -	 * */ -	sector_t
>> start_next_window; -	int			current_window_requests; -	int
>> next_window_requests; - spinlock_t		device_lock;
>> 
>> /* list of 'struct r1bio' that need to be processed by raid1d, @@
>> -79,10 +68,11 @@ struct r1conf { */ wait_queue_head_t
>> wait_barrier; spinlock_t		resync_lock; -	int			nr_pending; -	int
>> nr_waiting; -	int			nr_queued; -	int			barrier; +	int
>> nr_pending[BARRIER_BUCKETS_NR]; +	int
>> nr_waiting[BARRIER_BUCKETS_NR]; +	int
>> nr_queued[BARRIER_BUCKETS_NR]; +	int
>> barrier[BARRIER_BUCKETS_NR]; +	int			total_barriers; int
>> array_frozen;
>> 
>> /* Set to 1 if a full sync is needed, (fresh device added). @@
>> -135,7 +125,6 @@ struct r1bio { * in this BehindIO request */ 
>> sector_t		sector; -	sector_t		start_next_window; int			sectors; 
>> unsigned long		state; struct mddev		*mddev; --

Thanks for your review, I will update all the fixes in next version patch.

Coly

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html