On 2016/11/30 上午3:29, Shaohua Li wrote: > On Mon, Nov 28, 2016 at 02:42:22PM +0800, Coly Li wrote: >> On 2016/11/23 上午5:35, Shaohua Li wrote: >>> On Tue, Nov 22, 2016 at 05:54:00AM +0800, Coly Li wrote: >>>> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")' >>>> introduces a sliding resync window for raid1 I/O barrier, this idea limits >>>> I/O barriers to happen only inside a slidingresync window, for regular >>>> I/Os out of this resync window they don't need to wait for barrier any >>>> more. On large raid1 device, it helps a lot to improve parallel writing >>>> I/O throughput when there are background resync I/Os performing at >>>> same time. >>>> >>>> The idea of sliding resync widow is awesome, but there are several >>>> challenges are very difficult to solve, >>>> - code complexity >>>> Sliding resync window requires several veriables to work collectively, >>>> this is complexed and very hard to make it work correctly. Just grep >>>> "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix >>>> the original resync window patch. This is not the end, any further >>>> related modification may easily introduce more regreassion. >>>> - multiple sliding resync windows >>>> Currently raid1 code only has a single sliding resync window, we cannot >>>> do parallel resync with current I/O barrier implementation. >>>> Implementing multiple resync windows are much more complexed, and very >>>> hard to make it correctly. >>>> >>>> Therefore I decide to implement a much simpler raid1 I/O barrier, by >>>> removing resync window code, I believe life will be much easier. >>>> >>>> The brief idea of the simpler barrier is, >>>> - Do not maintain a logbal unique resync window >>>> - Use multiple hash buckets to reduce I/O barrier conflictions, regular >>>> I/O only has to wait for a resync I/O when both them have same barrier >>>> bucket index, vice versa. >>>> - I/O barrier can be recuded to an acceptable number if there are enought >>>> barrier buckets >>>> >>>> Here I explain how the barrier buckets are designed, >>>> - BARRIER_UNIT_SECTOR_SIZE >>>> The whole LBA address space of a raid1 device is divided into multiple >>>> barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE. >>>> Bio request won't go across border of barrier unit size, that means >>>> maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes. >>>> - BARRIER_BUCKETS_NR >>>> There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests >>>> hit different barrier units, they only need to compete I/O barrier with >>>> other I/Os which hit the same barrier bucket index with each other. The >>>> index of a barrier bucket which a bio should look for is calculated by >>>> get_barrier_bucket_idx(), >>>> (sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR >>>> sector is the start sector number of a bio. align_to_barrier_unit_end() >>>> will make sure the finall bio sent into generic_make_request() won't >>>> exceed border of the barrier unit size. >>>> - RRIER_BUCKETS_NR >>>> Number of barrier buckets is defined by, >>>> #define BARRIER_BUCKETS_NR (PAGE_SIZE/sizeof(long)) >>>> For 4KB page size, there are 512 buckets for each raid1 device. That >>>> means the propobility of full random I/O barrier confliction may be >>>> reduced down to 1/512. >>> >>> Thanks! The idea is awesome and does makes the code easier to understand. >>> >>> For hash conflict, I'm worrying about one case. resync does sequential access. >>> Say we have a sequential normal IO. If the first normal IO and resync IO have >>> conflict, it's possible they will have conflict subsequently, since both are >>> doing sequential access. We can change the hash to something like this: >>> >>> for the first 64M * 512, the hash is 0, 1, ... 511 >>> For the second 64M * 512, the hash is 1, 3, 5 ... >>> The third 64M * 512, the hash is 2, 5, 8 ... >>> >>> It should be very easy to implement something like this, and this should reduce >>> the conflict of sequential access. >>> >> >> Hi Shaohua, >> >> What I concerned was memory footprint. For very fast sequential I/O, >> lineal mapping buckets means each (64 bytes) cache line contains 8 long >> integer, it is very friendly for CPU caching, >> - sequential writing 8 barrier units only requires 1 memory fetching >> for each barrier variables (barrier[], nr_pending[], nr_waiting[], >> nr_queued[]). >> - memory prefetch may have positive effect in sequential I/O. >> >> It will be very rare that resync I/O and regular I/O are always >> conflicted in same barrier bucket: resync I/O throughput usually is >> slower than regular I/O, it is very hard to keep them always conflicting >> in same bucket. If this condition really happens, current sliding resync >> window implementation may also have a similar conflict (regular I/O >> always hits resync window). So the barrier buckets don't make things worse. >> >> If we use a non-continuous mapping, memory fingerprint of the buckets >> will be quite distributed, and occupies more cache lines for a >> sequential I/O, which will increase the probability of more cache bounce >> if there are multiple sequential I/O (on different offset) happen on >> different CPU cores. >> >> This is why I use a very simple linear buckets hashing. > > Don't think memory footprint matters much. And wasting a little memory (eg, > make the barrier and stuffes cacheline aligned) doesn't matter here too if > necessary. If we could reduce the hash confilict in an easy way, why not? I > think Neil's suggestion (using hash_long) makes sense. > Okey, I will do it in next version. >>>> Comparing to single sliding resync window, >>>> - Currently resync I/O grows linearly, therefore regular and resync I/O >>>> will have confliction within a single barrier units. So it is similar to >>>> single sliding resync window. >>>> - But a barrier unit bucket is shared by all barrier units with identical >>>> barrier uinit index, the probability of confliction might be higher >>>> than single sliding resync window, in condition that writing I/Os >>>> always hit barrier units which have identical barrier bucket index with >>>> the resync I/Os. This is a very rare condition in real I/O work loads, >>>> I cannot imagine how it could happen in practice. >>>> - Therefore we can achieve a good enough low confliction rate with much >>>> simpler barrier algorithm and implementation. >>>> >>>> If user has a (realy) large raid1 device, for example 10PB size, we may >>>> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro, >>>> it is possible to be a raid1-created-time-defined variable in future. >>>> >>>> There are two changes should be noticed, >>>> - In raid1d(), I change the code to decrease conf->nr_pending[idx] into >>>> single loop, it looks like this, >>>> spin_lock_irqsave(&conf->device_lock, flags); >>>> conf->nr_queued[idx]--; >>>> spin_unlock_irqrestore(&conf->device_lock, flags); >>>> This change generates more spin lock operations, but in next patch of >>>> this patch set, it will be replaced by a single line code, >>>> atomic_dec(conf->nr_queueud[idx]); >>>> So we don't need to worry about spin lock cost here. >>>> - In raid1_make_request(), wait_barrier() is replaced by, >>>> a) wait_read_barrier(): wait barrier in regular read I/O code path >>>> b) wait_barrier(): wait barrier in regular write I/O code path >>>> The differnece is wait_read_barrier() only waits if array is frozen, I >>>> am not able to combile them into one function, because they must receive >>>> differnet data types in their arguments list. >>>> - align_to_barrier_unit_end() is called to make sure both regular and >>>> resync I/O won't go across the border of a barrier unit size. >>>> >>>> Open question: >>>> - Need review from md clustring developer, I don't touch related code now. >>> >>> Don't think it matters, but please open eyes, Guoqing! >>> >>>> Signed-off-by: Coly Li <colyli@xxxxxxx> >>>> Cc: Shaohua Li <shli@xxxxxx> >>>> Cc: Neil Brown <neilb@xxxxxxx> >>>> Cc: Johannes Thumshirn <jthumshirn@xxxxxxx> >>>> Cc: Guoqing Jiang <gqjiang@xxxxxxxx> >>>> --- >>>> drivers/md/raid1.c | 389 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------- >>>> drivers/md/raid1.h | 42 +++++------- >>>> 2 files changed, 242 insertions(+), 189 deletions(-) >>>> >>>> Index: linux-raid1/drivers/md/raid1.c >>>> =================================================================== >>>> --- linux-raid1.orig/drivers/md/raid1.c >>>> +++ linux-raid1/drivers/md/raid1.c >>>> @@ -66,9 +66,8 @@ >>>> */ >>>> static int max_queued_requests = 1024; >>>> >>>> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window, >>>> - sector_t bi_sector); >>>> -static void lower_barrier(struct r1conf *conf); >>>> +static void allow_barrier(struct r1conf *conf, sector_t sector_nr); >>>> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr); >>>> >>>> static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data) >>>> { >>>> @@ -92,7 +91,6 @@ static void r1bio_pool_free(void *r1_bio >>>> #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9) >>>> #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW) >>>> #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9) >>>> -#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS) >>>> >>>> static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data) >>>> { >>>> @@ -198,6 +196,7 @@ static void put_buf(struct r1bio *r1_bio >>>> { >>>> struct r1conf *conf = r1_bio->mddev->private; >>>> int i; >>>> + sector_t sector_nr = r1_bio->sector; >>>> >>>> for (i = 0; i < conf->raid_disks * 2; i++) { >>>> struct bio *bio = r1_bio->bios[i]; >>>> @@ -207,7 +206,7 @@ static void put_buf(struct r1bio *r1_bio >>>> >>>> mempool_free(r1_bio, conf->r1buf_pool); >>>> >>>> - lower_barrier(conf); >>>> + lower_barrier(conf, sector_nr); >>>> } >>>> >>>> static void reschedule_retry(struct r1bio *r1_bio) >>>> @@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi >>>> unsigned long flags; >>>> struct mddev *mddev = r1_bio->mddev; >>>> struct r1conf *conf = mddev->private; >>>> + sector_t sector_nr; >>>> + long idx; >>>> + >>>> + sector_nr = r1_bio->sector; >>>> + idx = get_barrier_bucket_idx(sector_nr); >>>> >>>> spin_lock_irqsave(&conf->device_lock, flags); >>>> list_add(&r1_bio->retry_list, &conf->retry_list); >>>> - conf->nr_queued ++; >>>> + conf->nr_queued[idx]++; >>>> spin_unlock_irqrestore(&conf->device_lock, flags); >>>> >>>> wake_up(&conf->wait_barrier); >>>> @@ -235,8 +239,6 @@ static void call_bio_endio(struct r1bio >>>> struct bio *bio = r1_bio->master_bio; >>>> int done; >>>> struct r1conf *conf = r1_bio->mddev->private; >>>> - sector_t start_next_window = r1_bio->start_next_window; >>>> - sector_t bi_sector = bio->bi_iter.bi_sector; >>>> >>>> if (bio->bi_phys_segments) { >>>> unsigned long flags; >>>> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio >>>> if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) >>>> bio->bi_error = -EIO; >>>> >>>> - if (done) { >>>> + if (done) >>>> bio_endio(bio); >>>> - /* >>>> - * Wake up any possible resync thread that waits for the device >>>> - * to go idle. >>>> - */ >>>> - allow_barrier(conf, start_next_window, bi_sector); >>>> - } >>>> } >>>> >>>> static void raid_end_bio_io(struct r1bio *r1_bio) >>>> { >>>> struct bio *bio = r1_bio->master_bio; >>>> + struct r1conf *conf = r1_bio->mddev->private; >>>> >>>> /* if nobody has done the final endio yet, do it now */ >>>> if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) { >>>> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio >>>> >>>> call_bio_endio(r1_bio); >>>> } >>>> + >>>> + /* >>>> + * Wake up any possible resync thread that waits for the device >>>> + * to go idle. >>>> + */ >>>> + allow_barrier(conf, r1_bio->sector); >>> Why this change? >>> >> >> I move allow_barrier() here on purpose. Because current raid1 code >> raises nr_pending for each master bio, the barrier buckets patch raises >> nr_pending[idx] for each r1_bio. >> >> Current code increases nr_pending for each master bio (the bio structure >> received by raid1_make_request()). A master bio may be split by multiple >> r1_bio structures, when all r1_bio completes, nr_pending is decreased >> for the original master bio. >> >> Since the master bio may go across multiple barrier units, in this patch >> master bio will be split into multiple r1_bio structures for each >> barrier units. In this case, we need to call wait_barrer() for each >> r1_bio, and call allow_barrier() on each r1_bio completion. This is why >> allow_barrier() is moved form master bio completion time to r1_bio >> completion time. >> >> A master bio may also be split into multiple r1_bio if bad blocks >> encountered, and these r1_bio may stay in same barrier unit. In this >> case the different r1_bio does increase nr_pending[] for same bucket >> index, we should also call wait_barrier() for each r1_bio and call >> allow_barrier() at its completion time. > > I think if you do bio_split before raid1_make_request, these issues go away. > Let me try it :-) >>>> free_r1bio(r1_bio); >>>> } >>>> >>>> @@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r >>>> return mirror; >>>> } >>>> >>>> +/* bi_end_io callback for regular READ bio */ >>>> static void raid1_end_read_request(struct bio *bio) >>>> { >>>> int uptodate = !bio->bi_error; >>>> @@ -490,6 +494,25 @@ static void raid1_end_write_request(stru >>>> bio_put(to_put); >>>> } >>>> >>>> +static sector_t align_to_barrier_unit_end(sector_t start_sector, >>>> + sector_t sectors) >>>> +{ >>>> + sector_t len; >>>> + >>>> + WARN_ON(sectors == 0); >>>> + /* len is the number of sectors from start_sector to end of the >>>> + * barrier unit which start_sector belongs to. >>>> + */ >>>> + len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) & >>>> + (~(BARRIER_UNIT_SECTOR_SIZE - 1))) - >>>> + start_sector; >>>> + >>>> + if (len > sectors) >>>> + len = sectors; >>>> + >>>> + return len; >>>> +} >>> >>> I'd prefer calling bio_split at the early of raid1_make_request and split the >>> bio if it crosses the bucket boundary. please see raid10_make_request for >>> example. resync will not cross the boundary. So the code will not need to worry >>> about the boundary. I believe this will make the code simpiler (all the >>> align_to_barrier_unit_end calling can removed) and easy to understand. >>> >> >> Indeed, my first implementation uses bio_split(). The reasons why I >> don't use it later are, >> - indeed I need to write more code. >> - I can't simply use existing r1_bio completion code to call >> bio_end_io() to the master bio when bio->bi_phys_segments is zero (see >> call_bio_endio()). Because r1_bio->master_bio is the split bio, not the >> original master bio received by raid1_make_request() >> >> Therefore finally I decide to use align_to_barrier_unit_end() to >> generate more r1_bio structures if the original master bio goes across >> barrier unit size, it makes less modification in this patch. > > That bi_phys_segments issue doesn't make sense to me. Please see how raid10 use > bio_split. > - rename raid1_make_request to __raid1_make_request > - add a raid1_make_request, split bio there, and then call __raid1_make_request > > the __raid1_make_request does everthing it currently does, the bi_phys_segments > should still work. This might add more code. But now we don't need to worry > about the boundary problem everywhere else except the new raid1_make_request. > For example, the allow_barrier issue goes away with this approach. This will > make the code more understandable and less error prone. > Thanks for the hint. I will use this method in next version. >>>> + >>>> /* >>>> * This routine returns the disk from which the requested read should >>>> * be done. There is a per-array 'next expected sequential IO' sector >>>> @@ -691,6 +714,7 @@ static int read_balance(struct r1conf *c >>>> conf->mirrors[best_disk].next_seq_sect = this_sector + sectors; >>>> } >>>> rcu_read_unlock(); >>>> + sectors = align_to_barrier_unit_end(this_sector, sectors); >>>> *max_sectors = sectors; >>>> >>>> return best_disk; >>>> @@ -779,168 +803,174 @@ static void flush_pending_writes(struct >>>> * there is no normal IO happeing. It must arrange to call >>>> * lower_barrier when the particular background IO completes. >>>> */ >>>> + >>>> static void raise_barrier(struct r1conf *conf, sector_t sector_nr) >>>> { >>>> + long idx = get_barrier_bucket_idx(sector_nr); >>>> + >>>> spin_lock_irq(&conf->resync_lock); >>>> >>>> /* Wait until no block IO is waiting */ >>>> - wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting, >>>> + wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx], >>>> conf->resync_lock); >>>> >>>> /* block any new IO from starting */ >>>> - conf->barrier++; >>>> - conf->next_resync = sector_nr; >>>> + conf->barrier[idx]++; >>>> >>>> /* For these conditions we must wait: >>>> * A: while the array is in frozen state >>>> - * B: while barrier >= RESYNC_DEPTH, meaning resync reach >>>> - * the max count which allowed. >>>> - * C: next_resync + RESYNC_SECTORS > start_next_window, meaning >>>> - * next resync will reach to the window which normal bios are >>>> - * handling. >>>> - * D: while there are any active requests in the current window. >>>> + * B: while conf->nr_pending[idx] is not 0, meaning regular I/O >>>> + * existing in sector number ranges corresponding to idx. >>>> + * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning resync reach >>>> + * the max count which allowed in sector number ranges >>>> + * conrresponding to idx. >>>> */ >>>> wait_event_lock_irq(conf->wait_barrier, >>>> - !conf->array_frozen && >>>> - conf->barrier < RESYNC_DEPTH && >>>> - conf->current_window_requests == 0 && >>>> - (conf->start_next_window >= >>>> - conf->next_resync + RESYNC_SECTORS), >>>> + !conf->array_frozen && !conf->nr_pending[idx] && >>>> + conf->barrier[idx] < RESYNC_DEPTH, >>>> conf->resync_lock); >>> >>> So there is a slight behavior change. Originally we guarautee no more than >>> RESYNC_DEPTH sync. Now this only checks 'conf->barrier[idx] < RESYNC_DEPTH'. >>> How about barrier[idx-1], barrier[idx-2]...? It's possible conf->barrier[idx] < >>> RESYNC_DEPTH, but barrier[idx] + barrier[idx-1] > RESYNC_DEPTH. Not sure how >>> important this is, but at least we can check barrier[idx] + barrier[idx-1]. >> >> The original code wants to rise more multiple barriers, but less then >> RESYNC_DEPTH. It helps resync I/O throughput with less resync I/O and >> regular I/O conflict. Considering conf->barrier is a global barrier, >> large RESYNC_DEPTH will starve regular I/O, it is only set to 32 for >> whole raid1 device. >> >> In the barrier buckets patch, conf->barrier[idx] only takes effect in a >> single bucket. More barriers raised in this barrier buckets won't >> interfere regular I/O in other barrier buckets, therefore we can have >> much more resync I/O barriers to raise, which is good for resync I/O >> throughput without starve more regular I/Os. >> >> If we have 512 barrier buckets, that means on the whole raid1 device >> there are 512*RESYNC_DEPTH barriers can be raised, and allows more >> parallel resync I/O. >> >> Before implementing parallel resync, I keep RESYNC_DEPTH as 32 in this >> patch, because we only have a resync I/O hits one barrier buckets, same >> RESYNC_DEPTH won't change barrier behavior now. Latter when we have >> parallel resync I/O, let's see whether we need to modify this value, >> also still keep it as 32. > > The problem is resync write could harm latency of regular IO. Even we have > resync limitation mechanism, sending a lot of resync write out could harm > regular IO latency for worst case. If the application has P99 metrics, more > resync IO definitively will harm it. This will not be easy to observe though. > Could we have a global pending counter to record barrier IO and using it in the > same way as the old code? That said I'm not sure how important this is, but > this could make things worse, so I'd like to avoid the regression. > Okey, I will maintain a global barrier depth counter, and keep global barrier depth to 32 as current upstream code behavior does. >>>> - >>>> - conf->nr_pending++; >>>> + conf->nr_pending[idx]++; >>>> spin_unlock_irq(&conf->resync_lock); >>>> } >>>> >>>> -static void lower_barrier(struct r1conf *conf) >>>> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr) >>>> { >>>> unsigned long flags; >>>> - BUG_ON(conf->barrier <= 0); >>>> + long idx = get_barrier_bucket_idx(sector_nr); >>>> + >>>> + BUG_ON(conf->barrier[idx] <= 0); >>>> spin_lock_irqsave(&conf->resync_lock, flags); >>>> - conf->barrier--; >>>> - conf->nr_pending--; >>>> + conf->barrier[idx]--; >>>> + conf->nr_pending[idx]--; >>>> spin_unlock_irqrestore(&conf->resync_lock, flags); >>>> wake_up(&conf->wait_barrier); >>>> } >>>> >>>> -static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio) >>>> +/* A regular I/O should wait when, >>>> + * - The whole array is frozen (both READ and WRITE) >>>> + * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised >>>> + */ >>>> +static void _wait_barrier(struct r1conf *conf, long idx) >>>> { >>>> - bool wait = false; >>>> - >>>> - if (conf->array_frozen || !bio) >>>> - wait = true; >>>> - else if (conf->barrier && bio_data_dir(bio) == WRITE) { >>>> - if ((conf->mddev->curr_resync_completed >>>> - >= bio_end_sector(bio)) || >>>> - (conf->next_resync + NEXT_NORMALIO_DISTANCE >>>> - <= bio->bi_iter.bi_sector)) >>>> - wait = false; >>>> - else >>>> - wait = true; >>>> + spin_lock_irq(&conf->resync_lock); >>>> + if (conf->array_frozen || conf->barrier[idx]) { >>>> + conf->nr_waiting[idx]++; >>>> + /* Wait for the barrier to drop. */ >>>> + wait_event_lock_irq( >>>> + conf->wait_barrier, >>>> + !conf->array_frozen && !conf->barrier[idx], >>>> + conf->resync_lock); >>>> + conf->nr_waiting[idx]--; >>>> } >>>> >>>> - return wait; >>>> + conf->nr_pending[idx]++; >>>> + spin_unlock_irq(&conf->resync_lock); >>>> } >>>> >>>> -static sector_t wait_barrier(struct r1conf *conf, struct bio *bio) >>>> +static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr) >>>> { >>>> - sector_t sector = 0; >>>> + long idx = get_barrier_bucket_idx(sector_nr); >>>> >>>> spin_lock_irq(&conf->resync_lock); >>>> - if (need_to_wait_for_sync(conf, bio)) { >>>> - conf->nr_waiting++; >>>> - /* Wait for the barrier to drop. >>>> - * However if there are already pending >>>> - * requests (preventing the barrier from >>>> - * rising completely), and the >>>> - * per-process bio queue isn't empty, >>>> - * then don't wait, as we need to empty >>>> - * that queue to allow conf->start_next_window >>>> - * to increase. >>>> - */ >>>> - wait_event_lock_irq(conf->wait_barrier, >>>> - !conf->array_frozen && >>>> - (!conf->barrier || >>>> - ((conf->start_next_window < >>>> - conf->next_resync + RESYNC_SECTORS) && >>>> - current->bio_list && >>>> - !bio_list_empty(current->bio_list))), >>>> - conf->resync_lock); >>>> - conf->nr_waiting--; >>>> - } >>>> - >>>> - if (bio && bio_data_dir(bio) == WRITE) { >>>> - if (bio->bi_iter.bi_sector >= conf->next_resync) { >>>> - if (conf->start_next_window == MaxSector) >>>> - conf->start_next_window = >>>> - conf->next_resync + >>>> - NEXT_NORMALIO_DISTANCE; >>>> - >>>> - if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE) >>>> - <= bio->bi_iter.bi_sector) >>>> - conf->next_window_requests++; >>>> - else >>>> - conf->current_window_requests++; >>>> - sector = conf->start_next_window; >>>> - } >>>> + if (conf->array_frozen) { >>>> + conf->nr_waiting[idx]++; >>>> + /* Wait for array to unfreeze */ >>>> + wait_event_lock_irq( >>>> + conf->wait_barrier, >>>> + !conf->array_frozen, >>>> + conf->resync_lock); >>>> + conf->nr_waiting[idx]--; >>>> } >>>> - >>>> - conf->nr_pending++; >>>> + conf->nr_pending[idx]++; >>>> spin_unlock_irq(&conf->resync_lock); >>>> - return sector; >>>> } >>>> >>>> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window, >>>> - sector_t bi_sector) >>>> +static void wait_barrier(struct r1conf *conf, sector_t sector_nr) >>>> +{ >>>> + long idx = get_barrier_bucket_idx(sector_nr); >>>> + >>>> + _wait_barrier(conf, idx); >>>> +} >>>> + >>>> +static void wait_all_barriers(struct r1conf *conf) >>>> +{ >>>> + long idx; >>>> + >>>> + for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++) >>>> + _wait_barrier(conf, idx); >>> >>> This is racy. Since resync is sequential, idealy we only need to check the >>> bucket sequentially. The compiler could do _wait_barrier(conf, 1) and then >>> _wait_barrier(conf, 0). It's possible the bucket 1 has barrier right after the >>> check of bucket 0. Even the compiler doesn't reorder, there is case the bucket >>> 511 should be check before bucket 0 if the resync is just crossing 512 buckets. >>> It would be better we check the resync position and adjust the bucket wait >>> order according to the position. >>> >> >> wait_all_barriers() and allow_all_barriers() are used in close_sync() only, >> static void close_sync(struct r1conf *conf) >> { >> - wait_barrier(conf, NULL); >> - allow_barrier(conf, 0, 0); >> + wait_all_barriers(conf); >> + allow_all_barriers(conf); >> [snip] >> } >> >> wait_barrier() and allow_barrier() called here is for a synchronization >> purpose, to make sure before close_sync() returns there is no resync I/O >> existing. >> >> close_sync() is called in raid1_sync_request() when resync I/O exceeds >> the size of raid1 device, and resync should be closed. In this >> condition, there won't be any new resync I/O generated. If a >> conf->barrier[idx] reaches 0, it won't increase before a new resync >> restarts. Therefore it is safe to call _wait_barrier() one by one, until >> every conf->barrier[idx] reaches to 0. >> >> This is a usage of wait/allow_barrier() which is not mentioned in the >> code. They are not for regular I/O waiting for resync I/O, they are used >> as a synchronization to make sure all on-flying resync I/O to complete. > > thanks, this makes sense. > >>>> int good_sectors = RESYNC_SECTORS; >>>> int min_bad = 0; /* number of sectors that are bad in all devices */ >>>> + long idx = get_barrier_bucket_idx(sector_nr); >>>> >>>> if (!conf->r1buf_pool) >>>> if (init_resync(conf)) >>>> @@ -2535,7 +2571,7 @@ static sector_t raid1_sync_request(struc >>>> * If there is non-resync activity waiting for a turn, then let it >>>> * though before starting on this new sync request. >>>> */ >>>> - if (conf->nr_waiting) >>>> + if (conf->nr_waiting[idx]) >>>> schedule_timeout_uninterruptible(1); >>>> >>>> /* we are incrementing sector_nr below. To be safe, we check against >>>> @@ -2562,6 +2598,8 @@ static sector_t raid1_sync_request(struc >>>> r1_bio->sector = sector_nr; >>>> r1_bio->state = 0; >>>> set_bit(R1BIO_IsSync, &r1_bio->state); >>>> + /* make sure good_sectors won't go across barrier unit border */ >>>> + good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors); >>>> >>>> for (i = 0; i < conf->raid_disks * 2; i++) { >>>> struct md_rdev *rdev; >>>> @@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct >>>> if (!conf) >>>> goto abort; >>>> >>>> + conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL); >>>> + if (!conf->nr_pending) >>>> + goto abort; >>> >>> This allocation is a little wierd. I'd define a count and uses >>> sizeof(nr_pending) * count to do the allocation. There is nothing related to >>> PAGE_SIZE. Alternatively, just make nr_pending an array in r1conf. >> >> This is just for better memory usage, r1conf is allocated by slab >> allocator, large not aligned size may cause internal memory >> fragmentation inside slab's pages. call kzalloc() with PAGE_SIZE will >> avoid such issue. > > Not sure if slab has closest slab for this, but that's not important. I think > sizeof(nr_pending) * count makes more sense here, even sizeof(nr_pending) > * count == PAGE_SIZE. > Okey, I will move these counters into r1conf, in next version. Thanks for your review and suggestion ! Coly -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html