Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler

Paolo Valente <paolo.valente@xxxxxxxxxx> · Tue, 14 Mar 2017 15:18:08 +0100

> Il giorno 06 mar 2017, alle ore 20:40, Bart Van Assche <bart.vanassche@xxxxxxxxxxx> ha scritto:
> 

Hi Bart,
thanks for such an accurate review.  I'm addressing the issues you
raised, and I'll get back in touch as soon as I have finished.

Paolo

> On 03/04/2017 08:01 AM, Paolo Valente wrote:
>> BFQ is a proportional-share I/O scheduler, whose general structure,
>> plus a lot of code, are borrowed from CFQ.
>> [ ... ]
> 
> This description is very useful. However, since it is identical to the
> description this patch adds to Documentation/block/bfq-iosched.txt I
> propose to leave it out from the patch description.
> 
> What seems missing to me is an overview of the limitations of BFQ. Does
> BFQ e.g. support multiple hardware queues?
> 
>> +3. What are BFQ's tunable?
>> +==========================
>> +[ ... ]
> 
> A thorough knowledge of BFQ is required to tune it properly. Users don't
> want to tune I/O schedulers. Has it been considered to invent algorithms
> to tune these parameters automatically?
> 
>> + * Licensed under GPL-2.
> 
> The COPYING file at the top of the tree mentions that GPL-v2 licensing
> should be specified as follows close to the start of each source file:
> 
>    This program is free software; you can redistribute it and/or modify
>    it under the terms of the GNU General Public License as published by
>    the Free Software Foundation; either version 2 of the License, or
>    (at your option) any later version.
> 
>    This program is distributed in the hope that it will be useful,
>    but WITHOUT ANY WARRANTY; without even the implied warranty of
>    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>    GNU General Public License for more details.
> 
>    You should have received a copy of the GNU General Public License
>    along with this program; if not, write to the Free Software
>    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
>    02110-1301 USA
> 
>> + * BFQ is a proportional-share I/O scheduler, with some extra
>> + * low-latency capabilities. BFQ also supports full hierarchical
>> + * scheduling through cgroups. Next paragraphs provide an introduction
>> + * on BFQ inner workings. Details on BFQ benefits and usage can be
>> + * found in Documentation/block/bfq-iosched.txt.
> 
> That reference should be sufficient - please do not duplicate
> Documentation/block/bfq-iosched.txt in block/bfq-iosched.c.
> 
>> +/**
>> + * struct bfq_service_tree - per ioprio_class service tree.
>> + *
>> + * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
>> + * ioprio_class has its own independent scheduler, and so its own
>> + * bfq_service_tree.  All the fields are protected by the queue lock
>> + * of the containing bfqd.
>> + */
>> +struct bfq_service_tree {
>> +	/* tree for active entities (i.e., those backlogged) */
>> +	struct rb_root active;
>> +	/* tree for idle entities (i.e., not backlogged, with V <= F_i)*/
>> +	struct rb_root idle;
>> +
>> +	struct bfq_entity *first_idle;	/* idle entity with minimum F_i */
>> +	struct bfq_entity *last_idle;	/* idle entity with maximum F_i */
>> +
>> +	u64 vtime; /* scheduler virtual time */
>> +	/* scheduler weight sum; active and idle entities contribute to it */
>> +	unsigned long wsum;
>> +};
> 
> Inline comments next to structure members are ugly and make the
> structure definition hard to read. Please follow the instructions in
> Documentation/kernel-doc-nano-HOWTO.txt for documenting structure members.
> 
>> +	u64 finish; /* B-WF2Q+ finish timestamp (aka F_i) */
>> +	u64 start;  /* B-WF2Q+ start timestamp (aka S_i) */
> 
> For all times and timestamps, please document the time unit (e.g. s, ms,
> us, ns, jiffies, ...).
> 
>> +enum bfq_device_speed {
>> +	BFQ_BFQD_FAST,
>> +	BFQ_BFQD_SLOW,
>> +};
> 
> What is the meaning of "fast" and "slow" devices in this context?
> Anyway, since the first patch that uses this enum is patch 6, please
> defer introduction of this enum until patch 6.
> 
>> +
>> +/**
>> + * struct bfq_data - per-device data structure.
>> + *
>> + * All the fields are protected by @lock.
>> + */
>> +struct bfq_data {
>> +	/* device request queue */
>> +	struct request_queue *queue;
>> +	[ ... ]
>> +
>> +	/* on-disk position of the last served request */
>> +	sector_t last_position;
> 
> What is the relevance of last_position if there are multiple hardware
> queues? Will the BFQ algorithm fail to realize its guarantees in that case?
> 
> What is the relevance of this structure member for block devices that
> have multiple spindles, e.g. arrays of hard disks?
> 
>> +enum bfqq_state_flags {
>> +	BFQ_BFQQ_FLAG_busy = 0,		/* has requests or is in service */
>> +	BFQ_BFQQ_FLAG_wait_request,	/* waiting for a request */
>> +	BFQ_BFQQ_FLAG_non_blocking_wait_rq, /*
>> +					     * waiting for a request
>> +					     * without idling the device
>> +					     */
>> +	BFQ_BFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
>> +	BFQ_BFQQ_FLAG_idle_window,	/* slice idling enabled */
>> +	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
>> +	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
>> +	BFQ_BFQQ_FLAG_IO_bound,		/*
>> +					 * bfqq has timed-out at least once
>> +					 * having consumed at most 2/10 of
>> +					 * its budget
>> +					 */
>> +};
> 
> The "BFQ_BFQQ_FLAG_" prefix looks silly and too long to me. How about
> e.g. using the prefix "BFQQF_" instead?
> 
>> +#define BFQ_BFQQ_FNS(name)						\
>> +static void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)		\
>> +{									\
>> +	(bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name);			\
>> +}									\
>> +static void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)		\
>> +{									\
>> +	(bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name);			\
>> +}									\
>> +static int bfq_bfqq_##name(const struct bfq_queue *bfqq)		\
>> +{									\
>> +	return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0;	\
>> +}
> 
> Are the bodies of the above functions duplicates of __set_bit(),
> __clear_bit() and test_bit()?
> 
>> +/* Expiration reasons. */
>> +enum bfqq_expiration {
>> +	BFQ_BFQQ_TOO_IDLE = 0,		/*
>> +					 * queue has been idling for
>> +					 * too long
>> +					 */
>> +	BFQ_BFQQ_BUDGET_TIMEOUT,	/* budget took too long to be used */
>> +	BFQ_BFQQ_BUDGET_EXHAUSTED,	/* budget consumed */
>> +	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */
>> +	BFQ_BFQQ_PREEMPTED		/* preemption in progress */
>> +};
> 
> The prefix of these constants refers twice to "BFQ" and does not make it
> clear that these constants are about expiration. How about using the
> "BFQQE_" prefix instead?
> 
>> +/* Maximum backwards seek, in KiB. */
>> +static const int bfq_back_max = 16 * 1024;
> 
> Where does this constant come from? Should it depend on geometry data
> like e.g. the number of sectors in a cylinder?
> 
>> +#define for_each_entity(entity)	\
>> +	for (; entity ; entity = NULL)
> 
> Why has this confusing #define been introduced? Shouldn't all
> occurrences of this macro be changed into the equivalent "if (entity)"?
> We don't want silly macros like this in the Linux kernel.
> 
>> +#define for_each_entity_safe(entity, parent) \
>> +	for (parent = NULL; entity ; entity = parent)
> 
> Same question here - why has this macro been introduced and how has its
> name been chosen? Since this macro is used only once and since no value
> is assigned to 'parent' in the code controlled by this construct, please
> remove this macro and use something that is less confusing than a "for"
> loop for something that is not a loop.
> 
>> +/**
>> + * bfq_weight_to_ioprio - calc an ioprio from a weight.
>> + * @weight: the weight value to convert.
>> + *
>> + * To preserve as much as possible the old only-ioprio user interface,
>> + * 0 is used as an escape ioprio value for weights (numerically) equal or
>> + * larger than IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF.
>> + */
>> +static unsigned short bfq_weight_to_ioprio(int weight)
>> +{
>> +	return IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight < 0 ?
>> +		0 : IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight;
>> +}
> 
> Please consider using max() or max_t() to make this function less verbose.
> 
>> +
>> +/**
>> + * bfq_active_extract - remove an entity from the active tree.
>> + * @st: the service_tree containing the tree.
>> + * @entity: the entity being removed.
>> + */
>> +static void bfq_active_extract(struct bfq_service_tree *st,
>> +			       struct bfq_entity *entity)
>> +{
>> +	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
>> +	struct rb_node *node;
>> +
>> +	node = bfq_find_deepest(&entity->rb_node);
>> +	bfq_extract(&st->active, entity);
>> +
>> +	if (node)
>> +		bfq_update_active_tree(node);
>> +
>> +	if (bfqq)
>> +		list_del(&bfqq->bfqq_list);
>> +}
> 
> Which locks protect the data structures manipulated by this and other
> functions? Have you considered to use lockdep_assert_held() to document
> these assumptions?
> 
>> +	case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
> 
> Please don't use parentheses if no confusion is possible. Additionally,
> checkpatch should have requested you to insert a space before and after
> the logical or operator.
> 
>> +static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
>> +				       struct bfq_queue *bfqq)
>> +{
>> +	if (bfqq) {
>> +		bfq_mark_bfqq_budget_new(bfqq);
>> +		bfq_clear_bfqq_fifo_expire(bfqq);
>> +
>> +		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
> 
> Checkpatch should have asked you to insert spaces around the
> multiplication operator.
> 
>> +/*
>> + * bfq_default_budget - return the default budget for @bfqq on @bfqd.
>> + * @bfqd: the device descriptor.
>> + * @bfqq: the queue to consider.
>> + *
>> + * We use 3/4 of the @bfqd maximum budget as the default value
>> + * for the max_budget field of the queues.  This lets the feedback
>> + * mechanism to start from some middle ground, then the behavior
>> + * of the process will drive the heuristics towards high values, if
>> + * it behaves as a greedy sequential reader, or towards small values
>> + * if it shows a more intermittent behavior.
>> + */
>> +static unsigned long bfq_default_budget(struct bfq_data *bfqd,
>> +					struct bfq_queue *bfqq)
>> +{
>> +	unsigned long budget;
>> +
>> +	/*
>> +	 * When we need an estimate of the peak rate we need to avoid
>> +	 * to give budgets that are too short due to previous measurements.
>> +	 * So, in the first 10 assignments use a ``safe'' budget value.
>> +	 */
>> +	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
>> +		budget = bfq_default_max_budget;
>> +	else
>> +		budget = bfqd->bfq_max_budget;
>> +
>> +	return budget - budget / 4;
>> +}
> 
> Where does the magic constant "194" come from?
> 
> 
>> +	} else
>> +		/*
>> +		 * Async queues get always the maximum possible
>> +		 * budget, as for them we do not care about latency
>> +		 * (in addition, their ability to dispatch is limited
>> +		 * by the charging factor).
>> +		 */
>> +		budget = bfqd->bfq_max_budget;
>> +
> 
> Please balance braces. Checkpatch should have warned about the use of "}
> else" instead of "} else {".
> 
>> +static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
>> +{
>> +	unsigned long max_budget;
>> +
>> +	/*
>> +	 * The max_budget calculated when autotuning is equal to the
>> +	 * amount of sectors transferred in timeout at the
>> +	 * estimated peak rate.
>> +	 */
>> +	max_budget = (unsigned long)(peak_rate * 1000 *
>> +				     timeout >> BFQ_RATE_SHIFT);
>> +
>> +	return max_budget;
>> +}
> 
> Where does the constant 1000 come from? What are the units of peak_rate
> and timeout? What is the maximum value of peak_rate? Can the
> multiplication overflow?
> 
>> +/*
>> + * In addition to updating the peak rate, checks whether the process
>> + * is "slow", and returns 1 if so. This slow flag is used, in addition
>> + * to the budget timeout, to reduce the amount of service provided to
>> + * seeky processes, and hence reduce their chances to lower the
>> + * throughput. See the code for more details.
>> + */
>> +static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
>> +				 bool compensate)
>> +{
>> +	u64 bw, usecs, expected, timeout;
>> +	ktime_t delta;
>> +	int update = 0;
>> +
>> +	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
>> +		return false;
>> +
>> +	if (compensate)
>> +		delta = bfqd->last_idling_start;
>> +	else
>> +		delta = ktime_get();
>> +	delta = ktime_sub(delta, bfqd->last_budget_start);
>> +	usecs = ktime_to_us(delta);
>> +
>> +	/* Don't trust short/unrealistic values. */
>> +	if (usecs < 100 || usecs >= LONG_MAX)
>> +		return false;
> 
> If usecs >= LONG_MAX that indicates a kernel bug. Please consider
> triggering a kernel warning in that case.
> 
>> +/*
>> + * Budget timeout is not implemented through a dedicated timer, but
>> + * just checked on request arrivals and completions, as well as on
>> + * idle timer expirations.
>> + */
>> +static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
>> +{
>> +	if (bfq_bfqq_budget_new(bfqq) ||
>> +	    time_before(jiffies, bfqq->budget_timeout))
>> +		return false;
>> +	return true;
>> +}
> 
> Have you considered to use time_is_after_jiffies() instead of
> time_before(jiffies, ...)?
> 
>> +static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
>> +			  struct bfq_io_cq *bic, pid_t pid, int is_sync)
>> +{
>> +	RB_CLEAR_NODE(&bfqq->entity.rb_node);
>> +	INIT_LIST_HEAD(&bfqq->fifo);
>> +
>> +	bfqq->ref = 0;
>> +	bfqq->bfqd = bfqd;
>> +
>> +	if (bic)
>> +		bfq_set_next_ioprio_data(bfqq, bic);
>> +
>> +	if (is_sync) {
>> +		if (!bfq_class_idle(bfqq))
>> +			bfq_mark_bfqq_idle_window(bfqq);
>> +		bfq_mark_bfqq_sync(bfqq);
>> +	} else
>> +		bfq_clear_bfqq_sync(bfqq);
>> +
>> +	bfqq->ttime.last_end_request = ktime_get_ns() - (1ULL<<32);
>> +
>> +	bfq_mark_bfqq_IO_bound(bfqq);
>> +
>> +	bfqq->pid = pid;
>> +
>> +	/* Tentative initial value to trade off between thr and lat */
>> +	bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
>> +	bfqq->budget_timeout = bfq_smallest_from_now();
>> +	bfqq->pid = pid;
>> +
>> +	/* first request is almost certainly seeky */
>> +	bfqq->seek_history = 1;
>> +}
> 
> What is the meaning of the 1ULL << 32 constant?
> 
>> +static int __init bfq_init(void)
>> +{
>> +	int ret;
>> +	char msg[50] = "BFQ I/O-scheduler: v0";
> 
> Please leave out "[50]" and use "static const char" instead of "char".
> 
>> diff --git a/block/elevator.c b/block/elevator.c
>> index 01139f5..786fdcd 100644
>> --- a/block/elevator.c
>> +++ b/block/elevator.c
>> @@ -221,14 +221,20 @@ int elevator_init(struct request_queue *q, char *name)
>> 
>> 	if (!e) {
>> 		/*
>> -		 * For blk-mq devices, we default to using mq-deadline,
>> -		 * if available, for single queue devices. If deadline
>> -		 * isn't available OR we have multiple queues, default
>> -		 * to "none".
>> +		 * For blk-mq devices, we default to using bfq, if
>> +		 * available, for single queue devices. If bfq isn't
>> +		 * available, we try mq-deadline. If neither is
>> +		 * available, OR we have multiple queues, default to
>> +		 * "none".
>> 		 */
>> 		if (q->mq_ops) {
>> +			if (q->nr_hw_queues == 1) {
>> +				e = elevator_get("bfq", false);
>> +				if (!e)
>> +					e = elevator_get("mq-deadline", false);
>> +			}
>> 			if (q->nr_hw_queues == 1)
>> -				e = elevator_get("mq-deadline", false);
>> +				e = elevator_get("bfq", false);
>> 			if (!e)
>> 				return 0;
>> 		} else
>> 
> 
> As Jens wrote, it's way too early to make BFQ the default scheduler.
> 
> Bart.