On Wed, Sep 04, 2013 at 04:41:32PM +1000, NeilBrown wrote: > On Tue, 3 Sep 2013 15:02:28 +0800 Shaohua Li <shli@xxxxxxxxxx> wrote: > > > On Tue, Sep 03, 2013 at 04:08:58PM +1000, NeilBrown wrote: > > > On Wed, 28 Aug 2013 14:39:53 +0800 Shaohua Li <shli@xxxxxxxxxx> wrote: > > > > > > > On Wed, Aug 28, 2013 at 02:32:52PM +1000, NeilBrown wrote: > > > > > On Tue, 27 Aug 2013 16:53:30 +0800 Shaohua Li <shli@xxxxxxxxxx> wrote: > > > > > > > > > > > On Tue, Aug 27, 2013 at 01:17:52PM +1000, NeilBrown wrote: > > > > > > > > > > > > > > > > > > Then get_active_stripe wouldn't need to worry about device_lock at all and > > > > > > > would only need to get the hash lock for the particular sector. That should > > > > > > > make it a lot simpler. > > > > > > > > > > > > did you mean get_active_stripe() doesn't need device_lock for any code path? > > > > > > How could it be safe? device_lock still protects something like handle_list, > > > > > > delayed_list, which release_stripe() will use while a get_active_stripe can run > > > > > > concurrently. > > > > > > > > > > Yes you will still need device_lock to protect list_del_init(&sh->lru), > > > > > as well as the hash lock. > > > > > Do you need device_lock anywhere else in there? > > > > > > > > That's what I mean. So I need get both device_lock and hash_lock. To not > > > > deadlock, I need release hash_lock and relock device_lock/hash_lock. Since I > > > > release lock, I need recheck if I can find the stripe in hash again. So the > > > > seqcount locking doesn't simplify things here. I thought the seqlock only fixes > > > > one race. Did I miss anything? > > > > > > Can you order the locks so that you take the hash_lock first, then the > > > device_lock? That would be a lot simpler. > > > > Looks impossible. For example, in handle_active_stripes() we release several > > stripes, we can't take hash_lock first. > > "impossible" just takes a little longer :-) > > do_release_stripe gets called with only device_lock held. It gets passed an > (initially) empty list_head too. > If it wants to add the stripe to an inactive list it puts it on the given > list_head instead. > > release_stripe(), after calling do_release_stripe() calls some function to > grab the appropriate hash_lock for each stripe in the list_head and add it > to that inactive list. > > release_stripe_list() might collect some stripes from from __release_stripe > that need to go on an inactive list. It arranges for them to be put on the > right list, with the right lock, next time device_lock is dropped. That > might be in handle_active_stripes() > > activate_bit_delay might similarly collect stripes, which are handled the > same way as those collected by release_stripe_list. > etc. > > i.e. the hash_locks protect the various inactive lists. device_lock protects > all the others. If we need to add something to an inactive list while > holding device_lock we delay until device_lock can be dropped. Alright, this option works, but we need allocate some spaces, which isn't very good for unplug cb. Below is the patch I tested. Thanks, Shaohua Subject: raid5: relieve lock contention in get_active_stripe() get_active_stripe() is the last place we have lock contention. It has two paths. One is stripe isn't found and new stripe is allocated, the other is stripe is found. The first path basically calls __find_stripe and init_stripe. It accesses conf->generation, conf->previous_raid_disks, conf->raid_disks, conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded, conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except stripe_hashtbl and inactive_list, other fields are changed very rarely. With this patch, we split inactive_list and add new hash locks. Each free stripe belongs to a specific inactive list. Which inactive list is determined by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a lock_hash assigned. Stripe's inactive list is protected by a hash lock, which is determined by it's lock_hash too. The lock_hash is derivied from current stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl list too. The goal of the new hash locks introduced is we can only use the new locks in the first path of get_active_stripe(). Since we have several hash locks, lock contention is relieved significantly. The first path of get_active_stripe() accesses other fields, since they are changed rarely, changing them now need take conf->device_lock and all hash locks. For a slow path, this isn't a problem. If we need lock device_lock and hash lock, we always lock hash lock first. The tricky part is release_stripe and friends. We need take device_lock first. Neil's suggestion is we put inactive stripes to a temporary list and readd it to inactive_list after device_lock is released. In this way, we add stripes to temporary list with device_lock hold and remove stripes from the list with hash lock hold. So we don't allow concurrent access to the temporary list, which means we need allocate temporary list for all participants of release_stripe. One downside is free stripes are maintained in their inactive list, they can't across between the lists. By default, we have total 256 stripes and 8 lists, so each list will have 32 stripes. It's possible one list has free stripe but other list hasn't. The chance should be rare because stripes allocation are even distributed. And we can always allocate more stripes for cache, several mega bytes memory isn't a big deal. This completely removes the lock contention of the first path of get_active_stripe(). It slows down the second code path a little bit though because we now need takes two locks, but since the hash lock isn't contended, the overhead should be quite small (several atomic instructions). The second path of get_active_stripe() (basically sequential write or big request size randwrite) still has lock contentions. Signed-off-by: Shaohua Li <shli@xxxxxxxxxxxx> --- drivers/md/raid5.c | 346 ++++++++++++++++++++++++++++++++++++++++------------- drivers/md/raid5.h | 11 + 2 files changed, 276 insertions(+), 81 deletions(-) Index: linux/drivers/md/raid5.c =================================================================== --- linux.orig/drivers/md/raid5.c 2013-09-05 08:23:42.187851834 +0800 +++ linux/drivers/md/raid5.c 2013-09-05 12:52:47.581235145 +0800 @@ -86,6 +86,67 @@ static inline struct hlist_head *stripe_ return &conf->stripe_hashtbl[hash]; } +static inline int stripe_hash_locks_hash(sector_t sect) +{ + return (sect >> STRIPE_SHIFT) & STRIPE_HASH_LOCKS_MASK; +} + +static inline void lock_device_hash_lock(struct r5conf *conf, int hash) +{ + spin_lock_irq(conf->hash_locks + hash); + spin_lock(&conf->device_lock); +} + +static inline void unlock_device_hash_lock(struct r5conf *conf, int hash) +{ + spin_unlock(&conf->device_lock); + spin_unlock_irq(conf->hash_locks + hash); +} + +static void __lock_all_hash_locks(struct r5conf *conf) +{ + int i; + for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) + spin_lock(conf->hash_locks + i); +} + +static void __unlock_all_hash_locks(struct r5conf *conf) +{ + int i; + for (i = NR_STRIPE_HASH_LOCKS; i; i--) + spin_unlock(conf->hash_locks + i - 1); +} + +static inline void lock_all_device_hash_locks_irq(struct r5conf *conf) +{ + local_irq_disable(); + __lock_all_hash_locks(conf); + spin_lock(&conf->device_lock); +} + +static inline void unlock_all_device_hash_locks_irq(struct r5conf *conf) +{ + spin_unlock(&conf->device_lock); + __unlock_all_hash_locks(conf); + local_irq_enable(); +} + +static inline void lock_all_device_hash_locks_irqsave(struct r5conf *conf, + unsigned long *flags) +{ + local_irq_save(*flags); + __lock_all_hash_locks(conf); + spin_lock(&conf->device_lock); +} + +static inline void unlock_all_device_hash_locks_irqrestore(struct r5conf *conf, + unsigned long *flags) +{ + spin_unlock(&conf->device_lock); + __unlock_all_hash_locks(conf); + local_irq_restore(*flags); +} + /* bio's attached to a stripe+device for I/O are linked together in bi_sector * order without overlap. There may be several bio's per stripe+device, and * a bio could span several devices. @@ -250,7 +311,8 @@ static void raid5_wakeup_stripe_thread(s } } -static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh) +static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh, + struct list_head *temp_inactive_list) { BUG_ON(!list_empty(&sh->lru)); BUG_ON(atomic_read(&conf->active_stripes)==0); @@ -279,19 +341,59 @@ static void do_release_stripe(struct r5c < IO_THRESHOLD) md_wakeup_thread(conf->mddev->thread); atomic_dec(&conf->active_stripes); - if (!test_bit(STRIPE_EXPANDING, &sh->state)) { - list_add_tail(&sh->lru, &conf->inactive_list); - wake_up(&conf->wait_for_stripe); - if (conf->retry_read_aligned) - md_wakeup_thread(conf->mddev->thread); - } + if (!test_bit(STRIPE_EXPANDING, &sh->state)) + list_add_tail(&sh->lru, temp_inactive_list); } } -static void __release_stripe(struct r5conf *conf, struct stripe_head *sh) +static void __release_stripe(struct r5conf *conf, struct stripe_head *sh, + struct list_head *temp_inactive_list) { if (atomic_dec_and_test(&sh->count)) - do_release_stripe(conf, sh); + do_release_stripe(conf, sh, temp_inactive_list); +} + +/* + * @hash could be NR_STRIPE_HASH_LOCKS, then we have a list of inactive_list + * + * Be careful: Only one task can add/delete stripes from temp_inactive_list at + * given time. Adding stripes only takes device lock, while deleting stripes + * only takes hash lock. + */ +static void release_inactive_stripe_list(struct r5conf *conf, + struct list_head *temp_inactive_list, int hash) +{ + int size; + bool do_wakeup = false; + unsigned long flags; + + if (hash == NR_STRIPE_HASH_LOCKS) { + size = NR_STRIPE_HASH_LOCKS; + hash = NR_STRIPE_HASH_LOCKS - 1; + } else + size = 1; + while (size) { + struct list_head *list = &temp_inactive_list[size - 1]; + + /* + * We don't hold any lock here yet, get_active_stripe() might + * remove stripes from the list + */ + if (!list_empty_careful(list)) { + spin_lock_irqsave(conf->hash_locks + hash, flags); + list_splice_tail_init(list, conf->inactive_list + hash); + do_wakeup = true; + spin_unlock_irqrestore(conf->hash_locks + hash, flags); + } + size--; + hash--; + } + + if (do_wakeup) { + wake_up(&conf->wait_for_stripe); + if (conf->retry_read_aligned) + md_wakeup_thread(conf->mddev->thread); + } } static struct llist_node *llist_reverse_order(struct llist_node *head) @@ -309,7 +411,8 @@ static struct llist_node *llist_reverse_ } /* should hold conf->device_lock already */ -static int release_stripe_list(struct r5conf *conf) +static int release_stripe_list(struct r5conf *conf, + struct list_head *temp_inactive_list) { struct stripe_head *sh; int count = 0; @@ -318,6 +421,8 @@ static int release_stripe_list(struct r5 head = llist_del_all(&conf->released_stripes); head = llist_reverse_order(head); while (head) { + int hash; + sh = llist_entry(head, struct stripe_head, release_list); head = llist_next(head); /* sh could be readded after STRIPE_ON_RELEASE_LIST is cleard */ @@ -328,7 +433,8 @@ static int release_stripe_list(struct r5 * again, the count is always > 1. This is true for * STRIPE_ON_UNPLUG_LIST bit too. */ - __release_stripe(conf, sh); + hash = sh->hash_lock_index; + __release_stripe(conf, sh, &temp_inactive_list[hash]); count++; } @@ -339,6 +445,8 @@ static void release_stripe(struct stripe { struct r5conf *conf = sh->raid_conf; unsigned long flags; + struct list_head list; + int hash; bool wakeup; if (test_and_set_bit(STRIPE_ON_RELEASE_LIST, &sh->state)) @@ -351,8 +459,11 @@ slow_path: local_irq_save(flags); /* we are ok here if STRIPE_ON_RELEASE_LIST is set or not */ if (atomic_dec_and_lock(&sh->count, &conf->device_lock)) { - do_release_stripe(conf, sh); + INIT_LIST_HEAD(&list); + hash = sh->hash_lock_index; + do_release_stripe(conf, sh, &list); spin_unlock(&conf->device_lock); + release_inactive_stripe_list(conf, &list, hash); } local_irq_restore(flags); } @@ -377,18 +488,19 @@ static inline void insert_hash(struct r5 /* find an idle stripe, make sure it is unhashed, and return it. */ -static struct stripe_head *get_free_stripe(struct r5conf *conf) +static struct stripe_head *get_free_stripe(struct r5conf *conf, int hash) { struct stripe_head *sh = NULL; struct list_head *first; - if (list_empty(&conf->inactive_list)) + if (list_empty(conf->inactive_list + hash)) goto out; - first = conf->inactive_list.next; + first = (conf->inactive_list + hash)->next; sh = list_entry(first, struct stripe_head, lru); list_del_init(first); remove_hash(sh); atomic_inc(&conf->active_stripes); + BUG_ON(hash != sh->hash_lock_index); out: return sh; } @@ -567,33 +679,35 @@ get_active_stripe(struct r5conf *conf, s int previous, int noblock, int noquiesce) { struct stripe_head *sh; + int hash = stripe_hash_locks_hash(sector); pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector); - spin_lock_irq(&conf->device_lock); + spin_lock_irq(conf->hash_locks + hash); do { wait_event_lock_irq(conf->wait_for_stripe, conf->quiesce == 0 || noquiesce, - conf->device_lock); + *(conf->hash_locks + hash)); sh = __find_stripe(conf, sector, conf->generation - previous); if (!sh) { - if (!conf->inactive_blocked) - sh = get_free_stripe(conf); + sh = get_free_stripe(conf, hash); if (noblock && sh == NULL) break; if (!sh) { conf->inactive_blocked = 1; wait_event_lock_irq(conf->wait_for_stripe, - !list_empty(&conf->inactive_list) && - (atomic_read(&conf->active_stripes) - < (conf->max_nr_stripes *3/4) - || !conf->inactive_blocked), - conf->device_lock); + !list_empty(conf->inactive_list + hash) && + (atomic_read(&conf->active_stripes) + < (conf->max_nr_stripes * 3 / 4) + || !conf->inactive_blocked), + *(conf->hash_locks + hash)); conf->inactive_blocked = 0; } else init_stripe(sh, sector, previous); } else { + spin_lock(&conf->device_lock); + if (atomic_read(&sh->count)) { BUG_ON(!list_empty(&sh->lru) && !test_bit(STRIPE_EXPANDING, &sh->state) @@ -611,13 +725,14 @@ get_active_stripe(struct r5conf *conf, s sh->group = NULL; } } + spin_unlock(&conf->device_lock); } } while (sh == NULL); if (sh) atomic_inc(&sh->count); - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(conf->hash_locks + hash); return sh; } @@ -1585,7 +1700,7 @@ static void raid_run_ops(struct stripe_h put_cpu(); } -static int grow_one_stripe(struct r5conf *conf) +static int grow_one_stripe(struct r5conf *conf, int hash) { struct stripe_head *sh; sh = kmem_cache_zalloc(conf->slab_cache, GFP_KERNEL); @@ -1601,11 +1716,13 @@ static int grow_one_stripe(struct r5conf kmem_cache_free(conf->slab_cache, sh); return 0; } + sh->hash_lock_index = hash; /* we just created an active stripe so... */ atomic_set(&sh->count, 1); atomic_inc(&conf->active_stripes); INIT_LIST_HEAD(&sh->lru); release_stripe(sh); + conf->max_hash_nr_stripes[hash]++; return 1; } @@ -1613,6 +1730,7 @@ static int grow_stripes(struct r5conf *c { struct kmem_cache *sc; int devs = max(conf->raid_disks, conf->previous_raid_disks); + int hash; if (conf->mddev->gendisk) sprintf(conf->cache_name[0], @@ -1630,9 +1748,12 @@ static int grow_stripes(struct r5conf *c return 1; conf->slab_cache = sc; conf->pool_size = devs; - while (num--) - if (!grow_one_stripe(conf)) + hash = 0; + while (num--) { + if (!grow_one_stripe(conf, hash)) return 1; + hash = (hash + 1) % NR_STRIPE_HASH_LOCKS; + } return 0; } @@ -1690,6 +1811,7 @@ static int resize_stripes(struct r5conf int err; struct kmem_cache *sc; int i; + int hash, cnt; if (newsize <= conf->pool_size) return 0; /* never bother to shrink */ @@ -1729,19 +1851,28 @@ static int resize_stripes(struct r5conf * OK, we have enough stripes, start collecting inactive * stripes and copying them over */ + hash = 0; + cnt = 0; list_for_each_entry(nsh, &newstripes, lru) { - spin_lock_irq(&conf->device_lock); - wait_event_lock_irq(conf->wait_for_stripe, - !list_empty(&conf->inactive_list), - conf->device_lock); - osh = get_free_stripe(conf); - spin_unlock_irq(&conf->device_lock); + lock_device_hash_lock(conf, hash); + wait_event_cmd(conf->wait_for_stripe, + !list_empty(conf->inactive_list + hash), + unlock_device_hash_lock(conf, hash), + lock_device_hash_lock(conf, hash)); + osh = get_free_stripe(conf, hash); + unlock_device_hash_lock(conf, hash); atomic_set(&nsh->count, 1); for(i=0; i<conf->pool_size; i++) nsh->dev[i].page = osh->dev[i].page; for( ; i<newsize; i++) nsh->dev[i].page = NULL; + nsh->hash_lock_index = hash; kmem_cache_free(conf->slab_cache, osh); + cnt++; + if (cnt >= conf->max_hash_nr_stripes[hash]) { + hash++; + cnt = 0; + } } kmem_cache_destroy(conf->slab_cache); @@ -1800,13 +1931,14 @@ static int resize_stripes(struct r5conf return err; } -static int drop_one_stripe(struct r5conf *conf) +static int drop_one_stripe(struct r5conf *conf, int hash) { struct stripe_head *sh; - spin_lock_irq(&conf->device_lock); - sh = get_free_stripe(conf); - spin_unlock_irq(&conf->device_lock); + spin_lock_irq(conf->hash_locks + hash); + sh = get_free_stripe(conf, hash); + conf->max_hash_nr_stripes[hash]--; + spin_unlock_irq(conf->hash_locks + hash); if (!sh) return 0; BUG_ON(atomic_read(&sh->count)); @@ -1818,8 +1950,10 @@ static int drop_one_stripe(struct r5conf static void shrink_stripes(struct r5conf *conf) { - while (drop_one_stripe(conf)) - ; + int hash; + for (hash = 0; hash < NR_STRIPE_HASH_LOCKS; hash++) + while (drop_one_stripe(conf, hash)) + ; if (conf->slab_cache) kmem_cache_destroy(conf->slab_cache); @@ -2048,10 +2182,10 @@ static void error(struct mddev *mddev, s unsigned long flags; pr_debug("raid456: error called\n"); - spin_lock_irqsave(&conf->device_lock, flags); + lock_all_device_hash_locks_irqsave(conf, &flags); clear_bit(In_sync, &rdev->flags); mddev->degraded = calc_degraded(conf); - spin_unlock_irqrestore(&conf->device_lock, flags); + unlock_all_device_hash_locks_irqrestore(conf, &flags); set_bit(MD_RECOVERY_INTR, &mddev->recovery); set_bit(Blocked, &rdev->flags); @@ -3895,7 +4029,8 @@ static void raid5_activate_delayed(struc } } -static void activate_bit_delay(struct r5conf *conf) +static void activate_bit_delay(struct r5conf *conf, + struct list_head *temp_inactive_list) { /* device_lock is held */ struct list_head head; @@ -3903,9 +4038,11 @@ static void activate_bit_delay(struct r5 list_del_init(&conf->bitmap_list); while (!list_empty(&head)) { struct stripe_head *sh = list_entry(head.next, struct stripe_head, lru); + int hash; list_del_init(&sh->lru); atomic_inc(&sh->count); - __release_stripe(conf, sh); + hash = sh->hash_lock_index; + __release_stripe(conf, sh, &temp_inactive_list[hash]); } } @@ -3921,7 +4058,7 @@ int md_raid5_congested(struct mddev *mdd return 1; if (conf->quiesce) return 1; - if (list_empty_careful(&conf->inactive_list)) + if (atomic_read(&conf->active_stripes) == conf->max_nr_stripes) return 1; return 0; @@ -4251,6 +4388,7 @@ static struct stripe_head *__get_priorit struct raid5_plug_cb { struct blk_plug_cb cb; struct list_head list; + struct list_head temp_inactive_list[NR_STRIPE_HASH_LOCKS]; }; static void raid5_unplug(struct blk_plug_cb *blk_cb, bool from_schedule) @@ -4261,6 +4399,7 @@ static void raid5_unplug(struct blk_plug struct mddev *mddev = cb->cb.data; struct r5conf *conf = mddev->private; int cnt = 0; + int hash; if (cb->list.next && !list_empty(&cb->list)) { spin_lock_irq(&conf->device_lock); @@ -4278,11 +4417,14 @@ static void raid5_unplug(struct blk_plug * STRIPE_ON_RELEASE_LIST could be set here. In that * case, the count is always > 1 here */ - __release_stripe(conf, sh); + hash = sh->hash_lock_index; + __release_stripe(conf, sh, &cb->temp_inactive_list[hash]); cnt++; } spin_unlock_irq(&conf->device_lock); } + release_inactive_stripe_list(conf, cb->temp_inactive_list, + NR_STRIPE_HASH_LOCKS); if (mddev->queue) trace_block_unplug(mddev->queue, cnt, !from_schedule); kfree(cb); @@ -4303,8 +4445,12 @@ static void release_stripe_plug(struct m cb = container_of(blk_cb, struct raid5_plug_cb, cb); - if (cb->list.next == NULL) + if (cb->list.next == NULL) { + int i; INIT_LIST_HEAD(&cb->list); + for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) + INIT_LIST_HEAD(cb->temp_inactive_list + i); + } if (!test_and_set_bit(STRIPE_ON_UNPLUG_LIST, &sh->state)) list_add_tail(&sh->lru, &cb->list); @@ -4949,27 +5095,45 @@ static int retry_aligned_read(struct r5 } static int handle_active_stripes(struct r5conf *conf, int group, - struct r5worker *worker) + struct r5worker *worker, + struct list_head *temp_inactive_list) { struct stripe_head *batch[MAX_STRIPE_BATCH], *sh; - int i, batch_size = 0; + int i, batch_size = 0, hash; + bool release_inactive = false; while (batch_size < MAX_STRIPE_BATCH && (sh = __get_priority_stripe(conf, group)) != NULL) batch[batch_size++] = sh; - if (batch_size == 0) - return batch_size; + if (batch_size == 0) { + for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) + if (!list_empty(temp_inactive_list + i)) + break; + if (i == NR_STRIPE_HASH_LOCKS) + return batch_size; + release_inactive = true; + } spin_unlock_irq(&conf->device_lock); + release_inactive_stripe_list(conf, temp_inactive_list, + NR_STRIPE_HASH_LOCKS); + + if (release_inactive) { + spin_lock_irq(&conf->device_lock); + return 0; + } + for (i = 0; i < batch_size; i++) handle_stripe(batch[i]); cond_resched(); spin_lock_irq(&conf->device_lock); - for (i = 0; i < batch_size; i++) - __release_stripe(conf, batch[i]); + for (i = 0; i < batch_size; i++) { + hash = batch[i]->hash_lock_index; + __release_stripe(conf, batch[i], &temp_inactive_list[hash]); + } return batch_size; } @@ -4990,9 +5154,10 @@ static void raid5_do_work(struct work_st while (1) { int batch_size, released; - released = release_stripe_list(conf); + released = release_stripe_list(conf, worker->temp_inactive_list); - batch_size = handle_active_stripes(conf, group_id, worker); + batch_size = handle_active_stripes(conf, group_id, worker, + worker->temp_inactive_list); worker->working = false; if (!batch_size && !released) break; @@ -5031,7 +5196,7 @@ static void raid5d(struct md_thread *thr struct bio *bio; int batch_size, released; - released = release_stripe_list(conf); + released = release_stripe_list(conf, conf->temp_inactive_list); if ( !list_empty(&conf->bitmap_list)) { @@ -5041,7 +5206,7 @@ static void raid5d(struct md_thread *thr bitmap_unplug(mddev->bitmap); spin_lock_irq(&conf->device_lock); conf->seq_write = conf->seq_flush; - activate_bit_delay(conf); + activate_bit_delay(conf, conf->temp_inactive_list); } raid5_activate_delayed(conf); @@ -5055,7 +5220,8 @@ static void raid5d(struct md_thread *thr handled++; } - batch_size = handle_active_stripes(conf, ANY_GROUP, NULL); + batch_size = handle_active_stripes(conf, ANY_GROUP, NULL, + conf->temp_inactive_list); if (!batch_size && !released) break; handled += batch_size; @@ -5091,22 +5257,28 @@ raid5_set_cache_size(struct mddev *mddev { struct r5conf *conf = mddev->private; int err; + int hash; if (size <= 16 || size > 32768) return -EINVAL; + size = round_up(size, NR_STRIPE_HASH_LOCKS); + hash = 0; while (size < conf->max_nr_stripes) { - if (drop_one_stripe(conf)) + if (drop_one_stripe(conf, hash)) conf->max_nr_stripes--; else break; + hash = (hash + 1) % NR_STRIPE_HASH_LOCKS; } err = md_allow_write(mddev); if (err) return err; + hash = 0; while (size > conf->max_nr_stripes) { - if (grow_one_stripe(conf)) + if (grow_one_stripe(conf, hash)) conf->max_nr_stripes++; else break; + hash = (hash + 1) % NR_STRIPE_HASH_LOCKS; } return 0; } @@ -5257,7 +5429,7 @@ static struct attribute_group raid5_attr static int alloc_thread_groups(struct r5conf *conf, int cnt) { - int i, j; + int i, j, k; ssize_t size; struct r5worker *workers; @@ -5287,8 +5459,12 @@ static int alloc_thread_groups(struct r5 group->workers = workers + i * cnt; for (j = 0; j < cnt; j++) { - group->workers[j].group = group; - INIT_WORK(&group->workers[j].work, raid5_do_work); + struct r5worker *worker = group->workers + j; + worker->group = group; + INIT_WORK(&worker->work, raid5_do_work); + + for (k = 0; k < NR_STRIPE_HASH_LOCKS; k++) + INIT_LIST_HEAD(worker->temp_inactive_list + k); } } @@ -5439,6 +5615,7 @@ static struct r5conf *setup_conf(struct struct md_rdev *rdev; struct disk_info *disk; char pers_name[6]; + int i; if (mddev->new_level != 5 && mddev->new_level != 4 @@ -5483,7 +5660,6 @@ static struct r5conf *setup_conf(struct INIT_LIST_HEAD(&conf->hold_list); INIT_LIST_HEAD(&conf->delayed_list); INIT_LIST_HEAD(&conf->bitmap_list); - INIT_LIST_HEAD(&conf->inactive_list); init_llist_head(&conf->released_stripes); atomic_set(&conf->active_stripes, 0); atomic_set(&conf->preread_active_stripes, 0); @@ -5509,6 +5685,15 @@ static struct r5conf *setup_conf(struct if ((conf->stripe_hashtbl = kzalloc(PAGE_SIZE, GFP_KERNEL)) == NULL) goto abort; + for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) + spin_lock_init(conf->hash_locks + i); + + for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) + INIT_LIST_HEAD(conf->inactive_list + i); + + for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) + INIT_LIST_HEAD(conf->temp_inactive_list + i); + conf->level = mddev->new_level; if (raid5_alloc_percpu(conf) != 0) goto abort; @@ -6034,9 +6219,9 @@ static int raid5_spare_active(struct mdd sysfs_notify_dirent_safe(tmp->rdev->sysfs_state); } } - spin_lock_irqsave(&conf->device_lock, flags); + lock_all_device_hash_locks_irqsave(conf, &flags); mddev->degraded = calc_degraded(conf); - spin_unlock_irqrestore(&conf->device_lock, flags); + unlock_all_device_hash_locks_irqrestore(conf, &flags); print_raid5_conf(conf); return count; } @@ -6347,9 +6532,9 @@ static int raid5_start_reshape(struct md * ->degraded is measured against the larger of the * pre and post number of devices. */ - spin_lock_irqsave(&conf->device_lock, flags); + lock_all_device_hash_locks_irqsave(conf, &flags); mddev->degraded = calc_degraded(conf); - spin_unlock_irqrestore(&conf->device_lock, flags); + unlock_all_device_hash_locks_irqrestore(conf, &flags); } mddev->raid_disks = conf->raid_disks; mddev->reshape_position = conf->reshape_progress; @@ -6363,14 +6548,14 @@ static int raid5_start_reshape(struct md "reshape"); if (!mddev->sync_thread) { mddev->recovery = 0; - spin_lock_irq(&conf->device_lock); + lock_all_device_hash_locks_irq(conf); mddev->raid_disks = conf->raid_disks = conf->previous_raid_disks; rdev_for_each(rdev, mddev) rdev->new_data_offset = rdev->data_offset; smp_wmb(); conf->reshape_progress = MaxSector; mddev->reshape_position = MaxSector; - spin_unlock_irq(&conf->device_lock); + unlock_all_device_hash_locks_irq(conf); return -EAGAIN; } conf->reshape_checkpoint = jiffies; @@ -6388,13 +6573,13 @@ static void end_reshape(struct r5conf *c if (!test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) { struct md_rdev *rdev; - spin_lock_irq(&conf->device_lock); + lock_all_device_hash_locks_irq(conf); conf->previous_raid_disks = conf->raid_disks; rdev_for_each(rdev, conf->mddev) rdev->data_offset = rdev->new_data_offset; smp_wmb(); conf->reshape_progress = MaxSector; - spin_unlock_irq(&conf->device_lock); + unlock_all_device_hash_locks_irq(conf); wake_up(&conf->wait_for_overlap); /* read-ahead size must cover two whole stripes, which is @@ -6425,9 +6610,9 @@ static void raid5_finish_reshape(struct revalidate_disk(mddev->gendisk); } else { int d; - spin_lock_irq(&conf->device_lock); + lock_all_device_hash_locks_irq(conf); mddev->degraded = calc_degraded(conf); - spin_unlock_irq(&conf->device_lock); + unlock_all_device_hash_locks_irq(conf); for (d = conf->raid_disks ; d < conf->raid_disks - mddev->delta_disks; d++) { @@ -6457,27 +6642,28 @@ static void raid5_quiesce(struct mddev * break; case 1: /* stop all writes */ - spin_lock_irq(&conf->device_lock); + lock_all_device_hash_locks_irq(conf); /* '2' tells resync/reshape to pause so that all * active stripes can drain */ conf->quiesce = 2; - wait_event_lock_irq(conf->wait_for_stripe, + wait_event_cmd(conf->wait_for_stripe, atomic_read(&conf->active_stripes) == 0 && atomic_read(&conf->active_aligned_reads) == 0, - conf->device_lock); + unlock_all_device_hash_locks_irq(conf), + lock_all_device_hash_locks_irq(conf)); conf->quiesce = 1; - spin_unlock_irq(&conf->device_lock); + unlock_all_device_hash_locks_irq(conf); /* allow reshape to continue */ wake_up(&conf->wait_for_overlap); break; case 0: /* re-enable writes */ - spin_lock_irq(&conf->device_lock); + lock_all_device_hash_locks_irq(conf); conf->quiesce = 0; wake_up(&conf->wait_for_stripe); wake_up(&conf->wait_for_overlap); - spin_unlock_irq(&conf->device_lock); + unlock_all_device_hash_locks_irq(conf); break; } } Index: linux/drivers/md/raid5.h =================================================================== --- linux.orig/drivers/md/raid5.h 2013-09-05 08:23:42.187851834 +0800 +++ linux/drivers/md/raid5.h 2013-09-05 08:30:49.090484930 +0800 @@ -205,6 +205,7 @@ struct stripe_head { short pd_idx; /* parity disk index */ short qd_idx; /* 'Q' disk index for raid6 */ short ddf_layout;/* use DDF ordering to calculate Q */ + short hash_lock_index; unsigned long state; /* state flags */ atomic_t count; /* nr of active thread/requests */ int bm_seq; /* sequence number for bitmap flushes */ @@ -367,9 +368,13 @@ struct disk_info { struct md_rdev *rdev, *replacement; }; +#define NR_STRIPE_HASH_LOCKS 8 +#define STRIPE_HASH_LOCKS_MASK (NR_STRIPE_HASH_LOCKS - 1) + struct r5worker { struct work_struct work; struct r5worker_group *group; + struct list_head temp_inactive_list[NR_STRIPE_HASH_LOCKS]; bool working; }; @@ -382,6 +387,8 @@ struct r5worker_group { struct r5conf { struct hlist_head *stripe_hashtbl; + /* only protect corresponding hash list and inactive_list */ + spinlock_t hash_locks[NR_STRIPE_HASH_LOCKS]; struct mddev *mddev; int chunk_sectors; int level, algorithm; @@ -462,7 +469,8 @@ struct r5conf { * Free stripes pool */ atomic_t active_stripes; - struct list_head inactive_list; + struct list_head inactive_list[NR_STRIPE_HASH_LOCKS]; + int max_hash_nr_stripes[NR_STRIPE_HASH_LOCKS]; struct llist_head released_stripes; wait_queue_head_t wait_for_stripe; wait_queue_head_t wait_for_overlap; @@ -477,6 +485,7 @@ struct r5conf { * the new thread here until we fully activate the array. */ struct md_thread *thread; + struct list_head temp_inactive_list[NR_STRIPE_HASH_LOCKS]; struct r5worker_group *worker_groups; int group_cnt; int worker_cnt_per_group; -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html