Re: [patch v3 3/5] raid5: offload stripe handle to workqueue

NeilBrown <neilb@xxxxxxx> · Wed, 28 Aug 2013 16:56:14 +1000

On Wed, 28 Aug 2013 14:30:16 +0800 Shaohua Li <shli@xxxxxxxxxx> wrote:

> On Wed, Aug 28, 2013 at 01:53:05PM +1000, NeilBrown wrote:
> > On Tue, 27 Aug 2013 17:50:41 +0800 Shaohua Li <shli@xxxxxxxxxx> wrote:
> > 
> > > -static struct stripe_head *__get_priority_stripe(struct r5conf *conf)
> > > +static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
> > >  {
> > > -	struct stripe_head *sh;
> > > +	struct stripe_head *sh = NULL, *tmp;
> > > +	struct list_head *handle_list = NULL;
> > > +
> > > +	if (conf->worker_cnt_per_group == 0) {
> > > +		handle_list = &conf->handle_list;
> > > +		if (list_empty(handle_list))
> > > +			handle_list = NULL;
> > > +	} else if (group != ANY_GROUP) {
> > > +		handle_list = &conf->worker_groups[group].handle_list;
> > > +		if (list_empty(handle_list))
> > > +			handle_list = NULL;
> > > +	} else {
> > > +		int i;
> > > +		for (i = 0; i < conf->group_cnt; i++) {
> > > +			handle_list = &conf->worker_groups[i].handle_list;
> > > +			if (!list_empty(handle_list))
> > > +				break;
> > > +		}
> > > +		if (i == conf->group_cnt)
> > > +			handle_list = NULL;
> > > +	}
> > 
> > Sorry, I meant to mention this last time...
> > 
> > The setting of handle_list to NULL seems unnecessary.  It is sufficient to
> > test if it is empty.
> > The only interesting case is the last 'else' above.  That is only reached if
> > worker_cnt_per_group != 0 so handle_list will get set to some list_head.
> > If all list_heads are empty, handle_list will be set to the last list_head
> > so a list_empty(handle_list) test is perfectly safe.
> > 
> > Also with the current code:
> > >  
> > >  	pr_debug("%s: handle: %s hold: %s full_writes: %d bypass_count: %d\n",
> > >  		  __func__,
> > > -		  list_empty(&conf->handle_list) ? "empty" : "busy",
> > > +		  list_empty(handle_list) ? "empty" : "busy",
> >                   ^^^^^^^^^^^^^^^^^^^^^^^
> > This will crash.
> 
> Fixed. Other patches can still apply with hunks.
> 
> 
> Subject: raid5: offload stripe handle to workqueue
> 
> This is another attempt to create multiple threads to handle raid5 stripes.
> This time I use workqueue.
> 
> raid5 handles request (especially write) in stripe unit. A stripe is page size
> aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
> state machine for the corresponding stripe, which includes reading some disks
> of the stripe, calculating parity, and writing some disks of the stripe. The
> state machine is running in raid5d thread currently. Since there is only one
> thread, it doesn't scale well for high speed storage. An obvious solution is
> multi-threading.
> 
> To get better performance, we have some requirements:
> a. locality. stripe corresponding to request submitted from one cpu is better
> handled in thread in local cpu or local node. local cpu is preferred but some
> times could be a bottleneck, for example, parity calculation is too heavy.
> local node running has wide adaptability.
> b. configurablity. Different setup of raid5 array might need diffent
> configuration. Especially the thread number. More threads don't always mean
> better performance because of lock contentions.
> 
> My original implementation is creating some kernel threads. There are
> interfaces to control which cpu's stripe each thread should handle. And
> userspace can set affinity of the threads. This provides biggest flexibility
> and configurability. But it's hard to use and apparently a new thread pool
> implementation is disfavor.
> 
> Recent workqueue improvement is quite promising. unbound workqueue will be
> bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
> do affinity setting. For example, we can only include one HT sibling in
> affinity. Since work is non-reentrant by default, and we can control running
> thread number by limiting dispatched work_struct number.
> 
> In this patch, I created several stripe worker group. A group is a numa node.
> stripes from cpus of one node will be added to a group list. Workqueue thread
> of one node will only handle stripes of worker group of the node. In this way,
> stripe handling has numa node locality. And as I said, we can control thread
> number by limiting dispatched work_struct number.
> 
> The work_struct callback function handles several stripes in one run. A typical
> work queue usage is to run one unit in each work_struct. In raid5 case, the
> unit is a stripe. But we can't do that:
> a. Though handling a stripe doesn't need lock because of reference accounting
> and stripe isn't in any list, queuing a work_struct for each stripe will make
> workqueue lock contended very heavily.
> b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
> might dispatch request. If each work_struct only handles one stripe, such block
> plug is meaningless.
> 
> This implementation can't do very fine grained configuration. But the numa
> binding is most popular usage model, should be enough for most workloads. 
> 
> Note: since we have only one stripe queue, switching to multi-thread might
> decrease request size dispatching down to low level layer. The impact depends
> on thread number, raid configuration and workload. So multi-thread raid5 might
> not be proper for all setups.
> 
> Changes V1 -> V2:
> 1. remove WQ_NON_REENTRANT
> 2. disabling multi-threading by default
> 3. Add more descriptions in changelog
> 
> Signed-off-by: Shaohua Li <shli@xxxxxxxxxxxx>
> ---
>  drivers/md/raid5.c |  186 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>  drivers/md/raid5.h |   15 ++++
>  2 files changed, 186 insertions(+), 15 deletions(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2013-08-28 13:54:17.260930690 +0800
> +++ linux/drivers/md/raid5.c	2013-08-28 13:59:13.225210586 +0800
> @@ -53,6 +53,7 @@
>  #include <linux/cpu.h>
>  #include <linux/slab.h>
>  #include <linux/ratelimit.h>
> +#include <linux/nodemask.h>
>  #include <trace/events/block.h>
>  
>  #include "md.h"
> @@ -60,6 +61,10 @@
>  #include "raid0.h"
>  #include "bitmap.h"
>  
> +#define cpu_to_group(cpu) cpu_to_node(cpu)
> +#define ANY_GROUP NUMA_NO_NODE
> +
> +static struct workqueue_struct *raid5_wq;
>  /*
>   * Stripe cache
>   */
> @@ -200,6 +205,34 @@ static int stripe_operations_active(stru
>  	       test_bit(STRIPE_COMPUTE_RUN, &sh->state);
>  }
>  
> +static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
> +{
> +	struct r5conf *conf = sh->raid_conf;
> +	struct r5worker_group *group;
> +	int i, cpu = sh->cpu;
> +
> +	if (!cpu_online(cpu)) {
> +		cpu = cpumask_any(cpu_online_mask);
> +		sh->cpu = cpu;
> +	}
> +
> +	if (list_empty(&sh->lru)) {
> +		struct r5worker_group *group;
> +		group = conf->worker_groups + cpu_to_group(cpu);
> +		list_add_tail(&sh->lru, &group->handle_list);
> +	}
> +
> +	if (conf->worker_cnt_per_group == 0) {
> +		md_wakeup_thread(conf->mddev->thread);
> +		return;
> +	}
> +
> +	group = conf->worker_groups + cpu_to_group(sh->cpu);
> +
> +	for (i = 0; i < conf->worker_cnt_per_group; i++)
> +		queue_work_on(sh->cpu, raid5_wq, &group->workers[i].work);
> +}
> +
>  static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh)
>  {
>  	BUG_ON(!list_empty(&sh->lru));
> @@ -214,7 +247,12 @@ static void do_release_stripe(struct r5c
>  		else {
>  			clear_bit(STRIPE_DELAYED, &sh->state);
>  			clear_bit(STRIPE_BIT_DELAY, &sh->state);
> -			list_add_tail(&sh->lru, &conf->handle_list);
> +			if (conf->worker_cnt_per_group == 0) {
> +				list_add_tail(&sh->lru, &conf->handle_list);
> +			} else {
> +				raid5_wakeup_stripe_thread(sh);
> +				return;
> +			}
>  		}
>  		md_wakeup_thread(conf->mddev->thread);
>  	} else {
> @@ -409,6 +447,7 @@ static void init_stripe(struct stripe_he
>  		raid5_build_block(sh, i, previous);
>  	}
>  	insert_hash(conf, sh);
> +	sh->cpu = smp_processor_id();
>  }
>  
>  static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
> @@ -3830,6 +3869,7 @@ static void raid5_activate_delayed(struc
>  			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>  				atomic_inc(&conf->preread_active_stripes);
>  			list_add_tail(&sh->lru, &conf->hold_list);
> +			raid5_wakeup_stripe_thread(sh);
>  		}
>  	}
>  }
> @@ -4109,18 +4149,32 @@ static int chunk_aligned_read(struct mdd
>   * head of the hold_list has changed, i.e. the head was promoted to the
>   * handle_list.
>   */
> -static struct stripe_head *__get_priority_stripe(struct r5conf *conf)
> +static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
>  {
> -	struct stripe_head *sh;
> +	struct stripe_head *sh = NULL, *tmp;
> +	struct list_head *handle_list = NULL;
> +
> +	if (conf->worker_cnt_per_group == 0) {
> +		handle_list = &conf->handle_list;
> +	} else if (group != ANY_GROUP) {
> +		handle_list = &conf->worker_groups[group].handle_list;
> +	} else {
> +		int i;
> +		for (i = 0; i < conf->group_cnt; i++) {
> +			handle_list = &conf->worker_groups[i].handle_list;
> +			if (!list_empty(handle_list))
> +				break;
> +		}
> +	}
>  
>  	pr_debug("%s: handle: %s hold: %s full_writes: %d bypass_count: %d\n",
>  		  __func__,
> -		  list_empty(&conf->handle_list) ? "empty" : "busy",
> +		  list_empty(handle_list) ? "empty" : "busy",
>  		  list_empty(&conf->hold_list) ? "empty" : "busy",
>  		  atomic_read(&conf->pending_full_writes), conf->bypass_count);
>  
> -	if (!list_empty(&conf->handle_list)) {
> -		sh = list_entry(conf->handle_list.next, typeof(*sh), lru);
> +	if (!list_empty(handle_list)) {
> +		sh = list_entry(handle_list->next, typeof(*sh), lru);
>  
>  		if (list_empty(&conf->hold_list))
>  			conf->bypass_count = 0;
> @@ -4138,12 +4192,25 @@ static struct stripe_head *__get_priorit
>  		   ((conf->bypass_threshold &&
>  		     conf->bypass_count > conf->bypass_threshold) ||
>  		    atomic_read(&conf->pending_full_writes) == 0)) {
> -		sh = list_entry(conf->hold_list.next,
> -				typeof(*sh), lru);
> -		conf->bypass_count -= conf->bypass_threshold;
> -		if (conf->bypass_count < 0)
> -			conf->bypass_count = 0;
> -	} else
> +
> +		list_for_each_entry(tmp, &conf->hold_list,  lru) {
> +			if (conf->worker_cnt_per_group == 0 ||
> +			    group == ANY_GROUP ||
> +			    !cpu_online(tmp->cpu) ||
> +			    cpu_to_group(tmp->cpu) == group) {
> +				sh = tmp;
> +				break;
> +			}
> +		}
> +
> +		if (sh) {
> +			conf->bypass_count -= conf->bypass_threshold;
> +			if (conf->bypass_count < 0)
> +				conf->bypass_count = 0;
> +		}
> +	}
> +
> +	if (!sh)
>  		return NULL;
>  
>  	list_del_init(&sh->lru);
> @@ -4853,13 +4920,13 @@ static int  retry_aligned_read(struct r5
>  }
>  
>  #define MAX_STRIPE_BATCH 8
> -static int handle_active_stripes(struct r5conf *conf)
> +static int handle_active_stripes(struct r5conf *conf, int group)
>  {
>  	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
>  	int i, batch_size = 0;
>  
>  	while (batch_size < MAX_STRIPE_BATCH &&
> -			(sh = __get_priority_stripe(conf)) != NULL)
> +			(sh = __get_priority_stripe(conf, group)) != NULL)
>  		batch[batch_size++] = sh;
>  
>  	if (batch_size == 0)
> @@ -4877,6 +4944,38 @@ static int handle_active_stripes(struct
>  	return batch_size;
>  }
>  
> +static void raid5_do_work(struct work_struct *work)
> +{
> +	struct r5worker *worker = container_of(work, struct r5worker, work);
> +	struct r5worker_group *group = worker->group;
> +	struct r5conf *conf = group->conf;
> +	int group_id = group - conf->worker_groups;
> +	int handled;
> +	struct blk_plug plug;
> +
> +	pr_debug("+++ raid5worker active\n");
> +
> +	blk_start_plug(&plug);
> +	handled = 0;
> +	spin_lock_irq(&conf->device_lock);
> +	while (1) {
> +		int batch_size, released;
> +
> +		released = release_stripe_list(conf);
> +
> +		batch_size = handle_active_stripes(conf, group_id);
> +		if (!batch_size && !released)
> +			break;
> +		handled += batch_size;
> +	}
> +	pr_debug("%d stripes handled\n", handled);
> +
> +	spin_unlock_irq(&conf->device_lock);
> +	blk_finish_plug(&plug);
> +
> +	pr_debug("--- raid5worker inactive\n");
> +}
> +
>  /*
>   * This is our raid5 kernel thread.
>   *
> @@ -4926,7 +5025,7 @@ static void raid5d(struct md_thread *thr
>  			handled++;
>  		}
>  
> -		batch_size = handle_active_stripes(conf);
> +		batch_size = handle_active_stripes(conf, ANY_GROUP);
>  		if (!batch_size && !released)
>  			break;
>  		handled += batch_size;
> @@ -5066,6 +5165,54 @@ static struct attribute_group raid5_attr
>  	.attrs = raid5_attrs,
>  };
>  
> +static int alloc_thread_groups(struct r5conf *conf, int cnt)
> +{
> +	int i, j;
> +	ssize_t size;
> +	struct r5worker *workers;
> +
> +	conf->worker_cnt_per_group = cnt;
> +	if (cnt == 0) {
> +		conf->worker_groups = NULL;
> +		return 0;
> +	}
> +	conf->group_cnt = num_possible_nodes();
> +	size = sizeof(struct r5worker) * cnt;
> +	workers = kzalloc(size * conf->group_cnt, GFP_NOIO);
> +	conf->worker_groups = kzalloc(sizeof(struct r5worker_group) *
> +				conf->group_cnt, GFP_NOIO);
> +	if (!conf->worker_groups || !workers) {
> +		kfree(workers);
> +		kfree(conf->worker_groups);
> +		conf->worker_groups = NULL;
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < conf->group_cnt; i++) {
> +		struct r5worker_group *group;
> +
> +		group = &conf->worker_groups[i];
> +		INIT_LIST_HEAD(&group->handle_list);
> +		group->conf = conf;
> +		group->workers = workers + i * cnt;
> +
> +		for (j = 0; j < cnt; j++) {
> +			group->workers[j].group = group;
> +			INIT_WORK(&group->workers[j].work, raid5_do_work);
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static void free_thread_groups(struct r5conf *conf)
> +{
> +	if (conf->worker_groups)
> +		kfree(conf->worker_groups[0].workers);
> +	kfree(conf->worker_groups);
> +	conf->worker_groups = NULL;
> +}
> +
>  static sector_t
>  raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks)
>  {
> @@ -5106,6 +5253,7 @@ static void raid5_free_percpu(struct r5c
>  
>  static void free_conf(struct r5conf *conf)
>  {
> +	free_thread_groups(conf);
>  	shrink_stripes(conf);
>  	raid5_free_percpu(conf);
>  	kfree(conf->disks);
> @@ -5234,6 +5382,9 @@ static struct r5conf *setup_conf(struct
>  	conf = kzalloc(sizeof(struct r5conf), GFP_KERNEL);
>  	if (conf == NULL)
>  		goto abort;
> +	/* Don't enable multi-threading by default*/
> +	if (alloc_thread_groups(conf, 0))
> +		goto abort;
>  	spin_lock_init(&conf->device_lock);
>  	seqcount_init(&conf->gen_lock);
>  	init_waitqueue_head(&conf->wait_for_stripe);
> @@ -6549,6 +6700,10 @@ static struct md_personality raid4_perso
>  
>  static int __init raid5_init(void)
>  {
> +	raid5_wq = alloc_workqueue("raid5wq",
> +		WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE|WQ_SYSFS, 0);
> +	if (!raid5_wq)
> +		return -ENOMEM;
>  	register_md_personality(&raid6_personality);
>  	register_md_personality(&raid5_personality);
>  	register_md_personality(&raid4_personality);
> @@ -6560,6 +6715,7 @@ static void raid5_exit(void)
>  	unregister_md_personality(&raid6_personality);
>  	unregister_md_personality(&raid5_personality);
>  	unregister_md_personality(&raid4_personality);
> +	destroy_workqueue(raid5_wq);
>  }
>  
>  module_init(raid5_init);
> Index: linux/drivers/md/raid5.h
> ===================================================================
> --- linux.orig/drivers/md/raid5.h	2013-08-28 13:54:17.260930690 +0800
> +++ linux/drivers/md/raid5.h	2013-08-28 13:54:17.260930690 +0800
> @@ -212,6 +212,7 @@ struct stripe_head {
>  	enum check_states	check_state;
>  	enum reconstruct_states reconstruct_state;
>  	spinlock_t		stripe_lock;
> +	int			cpu;
>  	/**
>  	 * struct stripe_operations
>  	 * @target - STRIPE_OP_COMPUTE_BLK target
> @@ -365,6 +366,17 @@ struct disk_info {
>  	struct md_rdev	*rdev, *replacement;
>  };
>  
> +struct r5worker {
> +	struct work_struct work;
> +	struct r5worker_group *group;
> +};
> +
> +struct r5worker_group {
> +	struct list_head handle_list;
> +	struct r5conf *conf;
> +	struct r5worker *workers;
> +};
> +
>  struct r5conf {
>  	struct hlist_head	*stripe_hashtbl;
>  	struct mddev		*mddev;
> @@ -462,6 +474,9 @@ struct r5conf {
>  	 * the new thread here until we fully activate the array.
>  	 */
>  	struct md_thread	*thread;
> +	struct r5worker_group	*worker_groups;
> +	int			group_cnt;
> +	int			worker_cnt_per_group;
>  };
>  
>  /*

Applied, thanks.

NeilBrown
Attachment:
signature.asc

Description: PGP signature