> Il giorno 06 mar 2017, alle ore 20:40, Bart Van Assche <bart.vanassche@xxxxxxxxxxx> ha scritto: > Hi Bart, thanks for such an accurate review. I'm addressing the issues you raised, and I'll get back in touch as soon as I have finished. Paolo > On 03/04/2017 08:01 AM, Paolo Valente wrote: >> BFQ is a proportional-share I/O scheduler, whose general structure, >> plus a lot of code, are borrowed from CFQ. >> [ ... ] > > This description is very useful. However, since it is identical to the > description this patch adds to Documentation/block/bfq-iosched.txt I > propose to leave it out from the patch description. > > What seems missing to me is an overview of the limitations of BFQ. Does > BFQ e.g. support multiple hardware queues? > >> +3. What are BFQ's tunable? >> +========================== >> +[ ... ] > > A thorough knowledge of BFQ is required to tune it properly. Users don't > want to tune I/O schedulers. Has it been considered to invent algorithms > to tune these parameters automatically? > >> + * Licensed under GPL-2. > > The COPYING file at the top of the tree mentions that GPL-v2 licensing > should be specified as follows close to the start of each source file: > > This program is free software; you can redistribute it and/or modify > it under the terms of the GNU General Public License as published by > the Free Software Foundation; either version 2 of the License, or > (at your option) any later version. > > This program is distributed in the hope that it will be useful, > but WITHOUT ANY WARRANTY; without even the implied warranty of > MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > GNU General Public License for more details. > > You should have received a copy of the GNU General Public License > along with this program; if not, write to the Free Software > Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA > 02110-1301 USA > >> + * BFQ is a proportional-share I/O scheduler, with some extra >> + * low-latency capabilities. BFQ also supports full hierarchical >> + * scheduling through cgroups. Next paragraphs provide an introduction >> + * on BFQ inner workings. Details on BFQ benefits and usage can be >> + * found in Documentation/block/bfq-iosched.txt. > > That reference should be sufficient - please do not duplicate > Documentation/block/bfq-iosched.txt in block/bfq-iosched.c. > >> +/** >> + * struct bfq_service_tree - per ioprio_class service tree. >> + * >> + * Each service tree represents a B-WF2Q+ scheduler on its own. Each >> + * ioprio_class has its own independent scheduler, and so its own >> + * bfq_service_tree. All the fields are protected by the queue lock >> + * of the containing bfqd. >> + */ >> +struct bfq_service_tree { >> + /* tree for active entities (i.e., those backlogged) */ >> + struct rb_root active; >> + /* tree for idle entities (i.e., not backlogged, with V <= F_i)*/ >> + struct rb_root idle; >> + >> + struct bfq_entity *first_idle; /* idle entity with minimum F_i */ >> + struct bfq_entity *last_idle; /* idle entity with maximum F_i */ >> + >> + u64 vtime; /* scheduler virtual time */ >> + /* scheduler weight sum; active and idle entities contribute to it */ >> + unsigned long wsum; >> +}; > > Inline comments next to structure members are ugly and make the > structure definition hard to read. Please follow the instructions in > Documentation/kernel-doc-nano-HOWTO.txt for documenting structure members. > >> + u64 finish; /* B-WF2Q+ finish timestamp (aka F_i) */ >> + u64 start; /* B-WF2Q+ start timestamp (aka S_i) */ > > For all times and timestamps, please document the time unit (e.g. s, ms, > us, ns, jiffies, ...). > >> +enum bfq_device_speed { >> + BFQ_BFQD_FAST, >> + BFQ_BFQD_SLOW, >> +}; > > What is the meaning of "fast" and "slow" devices in this context? > Anyway, since the first patch that uses this enum is patch 6, please > defer introduction of this enum until patch 6. > >> + >> +/** >> + * struct bfq_data - per-device data structure. >> + * >> + * All the fields are protected by @lock. >> + */ >> +struct bfq_data { >> + /* device request queue */ >> + struct request_queue *queue; >> + [ ... ] >> + >> + /* on-disk position of the last served request */ >> + sector_t last_position; > > What is the relevance of last_position if there are multiple hardware > queues? Will the BFQ algorithm fail to realize its guarantees in that case? > > What is the relevance of this structure member for block devices that > have multiple spindles, e.g. arrays of hard disks? > >> +enum bfqq_state_flags { >> + BFQ_BFQQ_FLAG_busy = 0, /* has requests or is in service */ >> + BFQ_BFQQ_FLAG_wait_request, /* waiting for a request */ >> + BFQ_BFQQ_FLAG_non_blocking_wait_rq, /* >> + * waiting for a request >> + * without idling the device >> + */ >> + BFQ_BFQQ_FLAG_fifo_expire, /* FIFO checked in this slice */ >> + BFQ_BFQQ_FLAG_idle_window, /* slice idling enabled */ >> + BFQ_BFQQ_FLAG_sync, /* synchronous queue */ >> + BFQ_BFQQ_FLAG_budget_new, /* no completion with this budget */ >> + BFQ_BFQQ_FLAG_IO_bound, /* >> + * bfqq has timed-out at least once >> + * having consumed at most 2/10 of >> + * its budget >> + */ >> +}; > > The "BFQ_BFQQ_FLAG_" prefix looks silly and too long to me. How about > e.g. using the prefix "BFQQF_" instead? > >> +#define BFQ_BFQQ_FNS(name) \ >> +static void bfq_mark_bfqq_##name(struct bfq_queue *bfqq) \ >> +{ \ >> + (bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name); \ >> +} \ >> +static void bfq_clear_bfqq_##name(struct bfq_queue *bfqq) \ >> +{ \ >> + (bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name); \ >> +} \ >> +static int bfq_bfqq_##name(const struct bfq_queue *bfqq) \ >> +{ \ >> + return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0; \ >> +} > > Are the bodies of the above functions duplicates of __set_bit(), > __clear_bit() and test_bit()? > >> +/* Expiration reasons. */ >> +enum bfqq_expiration { >> + BFQ_BFQQ_TOO_IDLE = 0, /* >> + * queue has been idling for >> + * too long >> + */ >> + BFQ_BFQQ_BUDGET_TIMEOUT, /* budget took too long to be used */ >> + BFQ_BFQQ_BUDGET_EXHAUSTED, /* budget consumed */ >> + BFQ_BFQQ_NO_MORE_REQUESTS, /* the queue has no more requests */ >> + BFQ_BFQQ_PREEMPTED /* preemption in progress */ >> +}; > > The prefix of these constants refers twice to "BFQ" and does not make it > clear that these constants are about expiration. How about using the > "BFQQE_" prefix instead? > >> +/* Maximum backwards seek, in KiB. */ >> +static const int bfq_back_max = 16 * 1024; > > Where does this constant come from? Should it depend on geometry data > like e.g. the number of sectors in a cylinder? > >> +#define for_each_entity(entity) \ >> + for (; entity ; entity = NULL) > > Why has this confusing #define been introduced? Shouldn't all > occurrences of this macro be changed into the equivalent "if (entity)"? > We don't want silly macros like this in the Linux kernel. > >> +#define for_each_entity_safe(entity, parent) \ >> + for (parent = NULL; entity ; entity = parent) > > Same question here - why has this macro been introduced and how has its > name been chosen? Since this macro is used only once and since no value > is assigned to 'parent' in the code controlled by this construct, please > remove this macro and use something that is less confusing than a "for" > loop for something that is not a loop. > >> +/** >> + * bfq_weight_to_ioprio - calc an ioprio from a weight. >> + * @weight: the weight value to convert. >> + * >> + * To preserve as much as possible the old only-ioprio user interface, >> + * 0 is used as an escape ioprio value for weights (numerically) equal or >> + * larger than IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF. >> + */ >> +static unsigned short bfq_weight_to_ioprio(int weight) >> +{ >> + return IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight < 0 ? >> + 0 : IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight; >> +} > > Please consider using max() or max_t() to make this function less verbose. > >> + >> +/** >> + * bfq_active_extract - remove an entity from the active tree. >> + * @st: the service_tree containing the tree. >> + * @entity: the entity being removed. >> + */ >> +static void bfq_active_extract(struct bfq_service_tree *st, >> + struct bfq_entity *entity) >> +{ >> + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); >> + struct rb_node *node; >> + >> + node = bfq_find_deepest(&entity->rb_node); >> + bfq_extract(&st->active, entity); >> + >> + if (node) >> + bfq_update_active_tree(node); >> + >> + if (bfqq) >> + list_del(&bfqq->bfqq_list); >> +} > > Which locks protect the data structures manipulated by this and other > functions? Have you considered to use lockdep_assert_held() to document > these assumptions? > >> + case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */ > > Please don't use parentheses if no confusion is possible. Additionally, > checkpatch should have requested you to insert a space before and after > the logical or operator. > >> +static void __bfq_set_in_service_queue(struct bfq_data *bfqd, >> + struct bfq_queue *bfqq) >> +{ >> + if (bfqq) { >> + bfq_mark_bfqq_budget_new(bfqq); >> + bfq_clear_bfqq_fifo_expire(bfqq); >> + >> + bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8; > > Checkpatch should have asked you to insert spaces around the > multiplication operator. > >> +/* >> + * bfq_default_budget - return the default budget for @bfqq on @bfqd. >> + * @bfqd: the device descriptor. >> + * @bfqq: the queue to consider. >> + * >> + * We use 3/4 of the @bfqd maximum budget as the default value >> + * for the max_budget field of the queues. This lets the feedback >> + * mechanism to start from some middle ground, then the behavior >> + * of the process will drive the heuristics towards high values, if >> + * it behaves as a greedy sequential reader, or towards small values >> + * if it shows a more intermittent behavior. >> + */ >> +static unsigned long bfq_default_budget(struct bfq_data *bfqd, >> + struct bfq_queue *bfqq) >> +{ >> + unsigned long budget; >> + >> + /* >> + * When we need an estimate of the peak rate we need to avoid >> + * to give budgets that are too short due to previous measurements. >> + * So, in the first 10 assignments use a ``safe'' budget value. >> + */ >> + if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0) >> + budget = bfq_default_max_budget; >> + else >> + budget = bfqd->bfq_max_budget; >> + >> + return budget - budget / 4; >> +} > > Where does the magic constant "194" come from? > > >> + } else >> + /* >> + * Async queues get always the maximum possible >> + * budget, as for them we do not care about latency >> + * (in addition, their ability to dispatch is limited >> + * by the charging factor). >> + */ >> + budget = bfqd->bfq_max_budget; >> + > > Please balance braces. Checkpatch should have warned about the use of "} > else" instead of "} else {". > >> +static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout) >> +{ >> + unsigned long max_budget; >> + >> + /* >> + * The max_budget calculated when autotuning is equal to the >> + * amount of sectors transferred in timeout at the >> + * estimated peak rate. >> + */ >> + max_budget = (unsigned long)(peak_rate * 1000 * >> + timeout >> BFQ_RATE_SHIFT); >> + >> + return max_budget; >> +} > > Where does the constant 1000 come from? What are the units of peak_rate > and timeout? What is the maximum value of peak_rate? Can the > multiplication overflow? > >> +/* >> + * In addition to updating the peak rate, checks whether the process >> + * is "slow", and returns 1 if so. This slow flag is used, in addition >> + * to the budget timeout, to reduce the amount of service provided to >> + * seeky processes, and hence reduce their chances to lower the >> + * throughput. See the code for more details. >> + */ >> +static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq, >> + bool compensate) >> +{ >> + u64 bw, usecs, expected, timeout; >> + ktime_t delta; >> + int update = 0; >> + >> + if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq)) >> + return false; >> + >> + if (compensate) >> + delta = bfqd->last_idling_start; >> + else >> + delta = ktime_get(); >> + delta = ktime_sub(delta, bfqd->last_budget_start); >> + usecs = ktime_to_us(delta); >> + >> + /* Don't trust short/unrealistic values. */ >> + if (usecs < 100 || usecs >= LONG_MAX) >> + return false; > > If usecs >= LONG_MAX that indicates a kernel bug. Please consider > triggering a kernel warning in that case. > >> +/* >> + * Budget timeout is not implemented through a dedicated timer, but >> + * just checked on request arrivals and completions, as well as on >> + * idle timer expirations. >> + */ >> +static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq) >> +{ >> + if (bfq_bfqq_budget_new(bfqq) || >> + time_before(jiffies, bfqq->budget_timeout)) >> + return false; >> + return true; >> +} > > Have you considered to use time_is_after_jiffies() instead of > time_before(jiffies, ...)? > >> +static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq, >> + struct bfq_io_cq *bic, pid_t pid, int is_sync) >> +{ >> + RB_CLEAR_NODE(&bfqq->entity.rb_node); >> + INIT_LIST_HEAD(&bfqq->fifo); >> + >> + bfqq->ref = 0; >> + bfqq->bfqd = bfqd; >> + >> + if (bic) >> + bfq_set_next_ioprio_data(bfqq, bic); >> + >> + if (is_sync) { >> + if (!bfq_class_idle(bfqq)) >> + bfq_mark_bfqq_idle_window(bfqq); >> + bfq_mark_bfqq_sync(bfqq); >> + } else >> + bfq_clear_bfqq_sync(bfqq); >> + >> + bfqq->ttime.last_end_request = ktime_get_ns() - (1ULL<<32); >> + >> + bfq_mark_bfqq_IO_bound(bfqq); >> + >> + bfqq->pid = pid; >> + >> + /* Tentative initial value to trade off between thr and lat */ >> + bfqq->max_budget = bfq_default_budget(bfqd, bfqq); >> + bfqq->budget_timeout = bfq_smallest_from_now(); >> + bfqq->pid = pid; >> + >> + /* first request is almost certainly seeky */ >> + bfqq->seek_history = 1; >> +} > > What is the meaning of the 1ULL << 32 constant? > >> +static int __init bfq_init(void) >> +{ >> + int ret; >> + char msg[50] = "BFQ I/O-scheduler: v0"; > > Please leave out "[50]" and use "static const char" instead of "char". > >> diff --git a/block/elevator.c b/block/elevator.c >> index 01139f5..786fdcd 100644 >> --- a/block/elevator.c >> +++ b/block/elevator.c >> @@ -221,14 +221,20 @@ int elevator_init(struct request_queue *q, char *name) >> >> if (!e) { >> /* >> - * For blk-mq devices, we default to using mq-deadline, >> - * if available, for single queue devices. If deadline >> - * isn't available OR we have multiple queues, default >> - * to "none". >> + * For blk-mq devices, we default to using bfq, if >> + * available, for single queue devices. If bfq isn't >> + * available, we try mq-deadline. If neither is >> + * available, OR we have multiple queues, default to >> + * "none". >> */ >> if (q->mq_ops) { >> + if (q->nr_hw_queues == 1) { >> + e = elevator_get("bfq", false); >> + if (!e) >> + e = elevator_get("mq-deadline", false); >> + } >> if (q->nr_hw_queues == 1) >> - e = elevator_get("mq-deadline", false); >> + e = elevator_get("bfq", false); >> if (!e) >> return 0; >> } else >> > > As Jens wrote, it's way too early to make BFQ the default scheduler. > > Bart.