On 2022/12/20 18:50, Paolo Valente wrote: > Single-LUN multi-actuator SCSI drives, as well as all multi-actuator > SATA drives appear as a single device to the I/O subsystem [1]. Yet > they address commands to different actuators internally, as a function > of Logical Block Addressing (LBAs). A given sector is reachable by > only one of the actuators. For example, Seagate’s Serial Advanced > Technology Attachment (SATA) version contains two actuators and maps > the lower half of the SATA LBA space to the lower actuator and the > upper half to the upper actuator. > > Evidently, to fully utilize actuators, no actuator must be left idle > or underutilized while there is pending I/O for it. The block layer > must somehow control the load of each actuator individually. This > commit lays the ground for allowing BFQ to provide such a per-actuator > control. > > BFQ associates an I/O-request sync bfq_queue with each process doing > synchronous I/O, or with a group of processes, in case of queue > merging. Then BFQ serves one bfq_queue at a time. While in service, a > bfq_queue is emptied in request-position order. Yet the same process, > or group of processes, may generate I/O for different actuators. In > this case, different streams of I/O (each for a different actuator) > get all inserted into the same sync bfq_queue. So there is basically > no individual control on when each stream is served, i.e., on when the > I/O requests of the stream are picked from the bfq_queue and > dispatched to the drive. > > This commit enables BFQ to control the service of each actuator > individually for synchronous I/O, by simply splitting each sync > bfq_queue into N queues, one for each actuator. In other words, a sync > bfq_queue is now associated to a pair (process, actuator). As a > consequence of this split, the per-queue proportional-share policy > implemented by BFQ will guarantee that the sync I/O generated for each > actuator, by each process, receives its fair share of service. > > This is just a preparatory patch. If the I/O of the same process > happens to be sent to different queues, then each of these queues may > undergo queue merging. To handle this event, the bfq_io_cq data > structure must be properly extended. In addition, stable merging must > be disabled to avoid loss of control on individual actuators. Finally, > also async queues must be split. These issues are described in detail > and addressed in next commits. As for this commit, although multiple > per-process bfq_queues are provided, the I/O of each process or group > of processes is still sent to only one queue, regardless of the > actuator the I/O is for. The forwarding to distinct bfq_queues will be > enabled after addressing the above issues. > > [1] https://www.linaro.org/blog/budget-fair-queueing-bfq-linux-io-scheduler-optimizations-for-multi-actuator-sata-hard-drives/ > > Signed-off-by: Gabriele Felici <felicigb@xxxxxxxxx> > Signed-off-by: Carmine Zaccagnino <carmine@xxxxxxxxxxxxxxx> > Signed-off-by: Paolo Valente <paolo.valente@xxxxxxxxxx> > --- [...] If you want me to review, can you please add me to the CC list of this series ? I am only getting the patches that I already reviewed, which is not very useful... > @@ -672,9 +682,9 @@ static void bfq_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data) > { > struct bfq_data *bfqd = data->q->elevator->elevator_data; > struct bfq_io_cq *bic = bfq_bic_lookup(data->q); > - struct bfq_queue *bfqq = bic ? bic_to_bfqq(bic, op_is_sync(opf)) : NULL; > int depth; > unsigned limit = data->q->nr_requests; > + unsigned int act_idx; > > /* Sync reads have full depth available */ > if (op_is_sync(opf) && !op_is_write(opf)) { > @@ -684,14 +694,21 @@ static void bfq_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data) > limit = (limit * depth) >> bfqd->full_depth_shift; > } > > - /* > - * Does queue (or any parent entity) exceed number of requests that > - * should be available to it? Heavily limit depth so that it cannot > - * consume more available requests and thus starve other entities. > - */ > - if (bfqq && bfqq_request_over_limit(bfqq, limit)) > - depth = 1; > + for (act_idx = 0; act_idx < bfqd->num_actuators; act_idx++) { > + struct bfq_queue *bfqq = > + bic ? bic_to_bfqq(bic, op_is_sync(opf), act_idx) : NULL; Ignored my comment again... Oh well. If you prefer the code this way... I do not find it pretty nor solid as is though. > > + /* > + * Does queue (or any parent entity) exceed number of > + * requests that should be available to it? Heavily > + * limit depth so that it cannot consume more > + * available requests and thus starve other entities. > + */ > + if (bfqq && bfqq_request_over_limit(bfqq, limit)) { > + depth = 1; > + break; > + } > + } > bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u", > __func__, bfqd->wr_busy_queues, op_is_sync(opf), depth); > if (depth) [...] > -static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync) > +static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync, > + unsigned int actuator_idx) > { > - struct bfq_queue *bfqq = bic_to_bfqq(bic, is_sync); > + struct bfq_queue *bfqq = bic_to_bfqq(bic, is_sync, actuator_idx); > struct bfq_data *bfqd; > > if (bfqq) With your current bic_to_bfqq() implementation, you will *never* get NULL as a return value. So why is this if necessary ? > bfqd = bfqq->bfqd; /* NULL if scheduler already exited */ > > if (bfqq && bfqd) { > - unsigned long flags; > - > - spin_lock_irqsave(&bfqd->lock, flags); > bfqq->bic = NULL; > bfq_exit_bfqq(bfqd, bfqq); > - bic_set_bfqq(bic, NULL, is_sync); > - spin_unlock_irqrestore(&bfqd->lock, flags); > + bic_set_bfqq(bic, NULL, is_sync, actuator_idx); > } > } > > static void bfq_exit_icq(struct io_cq *icq) > { > struct bfq_io_cq *bic = icq_to_bic(icq); > + struct bfq_data *bfqd = bic_to_bfqd(bic); > + unsigned long flags; > + unsigned int act_idx; > + /* > + * If bfqd and thus bfqd->num_actuators is not available any > + * longer, then cycle over all possible per-actuator bfqqs in > + * next loop. We rely on bic being zeroed on creation, and > + * therefore on its unused per-actuator fields being NULL. > + */ > + unsigned int num_actuators = BFQ_MAX_ACTUATORS; > > - if (bic->stable_merge_bfqq) { > - struct bfq_data *bfqd = bic->stable_merge_bfqq->bfqd; > + /* > + * bfqd is NULL if scheduler already exited, and in that case > + * this is the last time these queues are accessed. > + */ > + if (bfqd) { Same here. bfqd can never be NULL. Or I am really missing something... Lots of other places like this where checking bic_to_bfqd() seems unnecessary. > + spin_lock_irqsave(&bfqd->lock, flags); > + num_actuators = bfqd->num_actuators; > + } > > - /* > - * bfqd is NULL if scheduler already exited, and in > - * that case this is the last time bfqq is accessed. > - */ > - if (bfqd) { > - unsigned long flags; > + if (bic->stable_merge_bfqq) > + bfq_put_stable_ref(bic->stable_merge_bfqq); > > - spin_lock_irqsave(&bfqd->lock, flags); > - bfq_put_stable_ref(bic->stable_merge_bfqq); > - spin_unlock_irqrestore(&bfqd->lock, flags); > - } else { > - bfq_put_stable_ref(bic->stable_merge_bfqq); > - } > + for (act_idx = 0; act_idx < num_actuators; act_idx++) { > + bfq_exit_icq_bfqq(bic, true, act_idx); > + bfq_exit_icq_bfqq(bic, false, act_idx); > } > > - bfq_exit_icq_bfqq(bic, true); > - bfq_exit_icq_bfqq(bic, false); > + if (bfqd) > + spin_unlock_irqrestore(&bfqd->lock, flags); > } -- Damien Le Moal Western Digital Research