Re: [PATCH v3 16/16] block/mq-deadline: Prioritize high-priority requests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 17, 2021 at 05:44:56PM -0700, Bart Van Assche wrote:
> While one or more requests with a certain I/O priority are pending, do not
> dispatch lower priority requests. Dispatch lower priority requests anyway
> after the "aging" time has expired.
> 
> This patch has been tested as follows:
> 
> modprobe scsi_debug ndelay=1000000 max_queue=16 &&
> sd='' &&
> while [ -z "$sd" ]; do
>   sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
> done &&
> echo $((100*1000)) > /sys/block/$sd/queue/iosched/aging_expire &&
> cd /sys/fs/cgroup/blkio/ &&
> echo $$ >cgroup.procs &&
> echo restrict-to-be >blkio.prio.class &&
> mkdir -p hipri &&
> cd hipri &&
> echo none-to-rt >blkio.prio.class &&
> { max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/low-pri.txt & } &&
> echo $$ >cgroup.procs &&
> max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/hi-pri.txt
> 
> Result:
> * 11000 IOPS for the high-priority job
> *    40 IOPS for the low-priority job
> 
> If the aging expiry time is changed from 100s into 0, the IOPS results change
> into 6712 and 6796 IOPS.
> 
> The max-iops script is a script that runs fio with the following arguments:
> --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
> --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
> --iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
> --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
> --filename=${positional_argument_1}
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@xxxxxxx>
> Cc: Hannes Reinecke <hare@xxxxxxx>
> Cc: Christoph Hellwig <hch@xxxxxx>
> Cc: Ming Lei <ming.lei@xxxxxxxxxx>
> Cc: Johannes Thumshirn <johannes.thumshirn@xxxxxxx>
> Cc: Himanshu Madhani <himanshu.madhani@xxxxxxxxxx>
> Signed-off-by: Bart Van Assche <bvanassche@xxxxxxx>
> ---

(snip)

> @@ -484,15 +495,32 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> -	struct request *rq;
> +	const u64 now_ns = ktime_get_ns();
> +	struct request *rq = NULL;
>  	enum dd_prio prio;
>  
>  	spin_lock(&dd->lock);
> -	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> -		rq = __dd_dispatch_request(dd, &dd->per_prio[prio]);
> +	/*
> +	 * Start with dispatching requests whose deadline expired more than
> +	 * aging_expire jiffies ago.
> +	 */
> +	for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
> +		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now_ns -
> +					   jiffies_to_nsecs(dd->aging_expire));
>  		if (rq)
> +			goto unlock;
> +	}
> +	/*
> +	 * Next, dispatch requests in priority order. Ignore lower priority
> +	 * requests if any higher priority requests are pending.
> +	 */
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> +		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now_ns);
> +		if (rq || dd_queued(dd, prio))
>  			break;
>  	}

Hello Bart,

I'm seeing some unwanted behavior since this patch was introduced.

If I have two processes submitting IO, one with IO class RT,
and one with IO class IDLE.

What happens is that in the "Next, dispatch requests in priority order."
for-loop, even when there are no RT requests to dispatch (

Here are some trace_printk's that I've added:

             fio-1493    [029] ...1   254.606782: dd_insert_requests: dd prio: 0 prio class: 1
First RT req inserted.

             fio-1493    [029] ...3   254.606786: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.606786: __dd_dispatch_request: dd prio: 0 prio class: 1
The first RT req gets dispatched immediately.

             fio-1493    [029] ...3   254.606791: dd_dispatch_request: rq ? ffff888144ff6800 queued ? 1
             fio-1493    [029] ...1   254.607005: dd_insert_requests: dd prio: 2 prio class: 3
First IDLE req inserted. Will be put on hold (for a max of aging_expire).

             fio-1493    [029] ...3   254.607007: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607008: dd_dispatch_request: rq ? 0000000000000000 queued ? 1
             fio-1493    [029] ...1   254.607132: dd_insert_requests: dd prio: 2 prio class: 3
             fio-1493    [029] ...3   254.607134: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607134: dd_dispatch_request: rq ? 0000000000000000 queued ? 1
             fio-1493    [029] ...1   254.607261: dd_insert_requests: dd prio: 0 prio class: 1
Second RT req is inserted.

             fio-1493    [029] ...3   254.607262: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607263: __dd_dispatch_request: dd prio: 0 prio class: 1
Second RT req gets dispatched immediately.

             fio-1493    [029] ...3   254.607263: dd_dispatch_request: rq ? ffff888144ff71c0 queued ? 2
             fio-1493    [029] ...3   254.607264: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607265: dd_dispatch_request: rq ? 0000000000000000 queued ? 2
             fio-1493    [029] ...3   254.607274: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607274: dd_dispatch_request: rq ? 0000000000000000 queued ? 2
             fio-1493    [029] ...1   254.607439: dd_insert_requests: dd prio: 2 prio class: 3
             fio-1493    [029] ...3   254.607440: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607441: dd_dispatch_request: rq ? 0000000000000000 queued ? 2
             fio-1493    [029] ...1   254.607618: dd_insert_requests: dd prio: 2 prio class: 3
             fio-1493    [029] ...3   254.607619: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607620: dd_dispatch_request: rq ? 0000000000000000 queued ? 2
             fio-1493    [029] ...1   254.607789: dd_insert_requests: dd prio: 2 prio class: 3
             fio-1493    [029] ...3   254.607790: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607791: dd_dispatch_request: rq ? 0000000000000000 queued ? 2
             fio-1493    [029] ..s.   254.607860: dd_finish_request: dd prio: 0 prio class: 1
First RT req is finished. IDLE requests are still on hold, since the second RT req hasn't finished yet.

             fio-1493    [029] ...1   254.607978: dd_insert_requests: dd prio: 2 prio class: 3
             fio-1493    [029] ...3   254.607980: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607980: dd_dispatch_request: rq ? 0000000000000000 queued ? 1
             fio-1493    [029] ...1   254.608021: dd_insert_requests: dd prio: 2 prio class: 3
             fio-1493    [029] ...3   254.608022: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.608023: dd_dispatch_request: rq ? 0000000000000000 queued ? 1
          <idle>-0       [029] .Ns1   254.608657: dd_finish_request: dd prio: 0 prio class: 1
Second RT req is finished.

             fio-1493    [029] ...1   254.608672: dd_insert_requests: dd prio: 2 prio class: 3
             fio-1493    [029] ...3   254.608674: dd_dispatch_request: checking dd_prio: 0
There are no RT reqs waiting to get dispatched, and no RT reqs in flight.

             fio-1493    [029] ...3   254.608675: dd_dispatch_request: checking dd_prio: 1
Therefore the loop continues to examine the next dd_prio (BE)

             fio-1493    [029] ...3   254.608675: dd_dispatch_request: rq ? 0000000000000000 queued ? 4294967100
There are no BE reqs waiting to get dispatched, however there are 4294967100 reqs in flight.

Since it says that there are BE reqs in flight, we never continue the loop to examine IDLE reqs.
(This number is obviously bogus.)
If there wasn't any BE reqs in flight, the loop would have moved on, and dispatched the IDLE reqs
at this point in time.
(Since there isn't any RT reqs to handle.)

   kworker/29:1H-464     [029] ...2   254.612489: dd_dispatch_request: checking dd_prio: 0
   kworker/29:1H-464     [029] ...2   254.612490: dd_dispatch_request: checking dd_prio: 1
   kworker/29:1H-464     [029] ...2   254.612491: dd_dispatch_request: rq ? 0000000000000000 queued ? 4294967100
   kworker/29:1H-464     [029] ...2   254.616483: dd_dispatch_request: checking dd_prio: 0
   kworker/29:1H-464     [029] ...2   254.616484: dd_dispatch_request: checking dd_prio: 1
   kworker/29:1H-464     [029] ...2   254.616485: dd_dispatch_request: rq ? 0000000000000000 queued ? 4294967100
   kworker/29:1H-464     [029] ...2   254.620479: dd_dispatch_request: checking dd_prio: 0
   kworker/29:1H-464     [029] ...2   254.620480: dd_dispatch_request: checking dd_prio: 1
   kworker/29:1H-464     [029] ...2   254.620481: dd_dispatch_request: rq ? 0000000000000000 queued ? 4294967100
   kworker/29:1H-464     [029] ...2   254.624481: dd_dispatch_request: checking dd_prio: 0
   kworker/29:1H-464     [029] ...2   254.624482: dd_dispatch_request: checking dd_prio: 1
   kworker/29:1H-464     [029] ...2   254.624482: dd_dispatch_request: rq ? 0000000000000000 queued ? 4294967100
   kworker/29:1H-464     [029] ...2   254.628478: dd_dispatch_request: checking dd_prio: 0
   kworker/29:1H-464     [029] ...2   254.628479: dd_dispatch_request: checking dd_prio: 1
   kworker/29:1H-464     [029] ...2   254.628480: dd_dispatch_request: rq ? 0000000000000000 queued ? 4294967100
   kworker/29:1H-464     [029] ...2   254.632478: dd_dispatch_request: checking dd_prio: 0
   kworker/29:1H-464     [029] ...2   254.632479: dd_dispatch_request: checking dd_prio: 1
   kworker/29:1H-464     [029] ...2   254.632480: dd_dispatch_request: rq ? 0000000000000000 queued ? 4294967100
   kworker/29:1H-464     [029] ...2   254.636477: dd_dispatch_request: checking dd_prio: 0
   kworker/29:1H-464     [029] ...2   254.636477: dd_dispatch_request: checking dd_prio: 1

...

   kworker/29:1H-464     [029] ...2   264.607254: __dd_dispatch_request: dd prio: 2 prio class: 3
Instead, because of the broken accouting of BE reqs, we have to wait 10 seconds,
until the aging_expire timeout occurs, until the first IDLE request is finally dispatched.
If it wasn't for the broken BE accounting, this first IDLE req would have been dispatched
as soon as the second RT req finished.



So, what is wrong here?

1) The "Next, dispatch requests in priority order." for-loop does:

 if (rq || dd_queued(dd, prio))

To check if it should process requests for the next priority class or not.

dd_queued() calls dd_sum() which has this comment:

/*                                                                                          
 * Returns the total number of dd_count(dd, event_type, prio) calls across all              
 * CPUs. No locking or barriers since it is fine if the returned sum is slightly            
 * outdated.                                                                                
 */

Perhaps not so got to use an accounting that is not accurate to determine
if we should process IOs belonging to a certain priority class or not.

Perhaps we could use e.g. atomics instead of per cpu counters without
locking?


2) I added even more trace_printk's and I saw this:

#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
  kworker/u64:11-628     [026] ....    13.650123: dd_finish_request: dd prio: 1 prio class: 0
  kworker/u64:11-628     [026] ....    13.650125: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:3-289     [023] ....    13.650217: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:3-289     [023] ....    13.650219: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:0-7       [030] ....    13.650226: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:0-7       [030] ....    13.650229: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:2-288     [014] ....    13.650365: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:2-288     [014] ....    13.650367: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:8-294     [024] ....    13.650405: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:8-294     [024] ....    13.650407: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:9-296     [018] ....    13.650418: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:9-296     [018] ....    13.650420: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:7-293     [031] ....    13.650444: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:7-293     [031] ....    13.650446: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:4-290     [001] ....    13.650491: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:4-290     [001] ....    13.650493: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:8-294     [024] ....    13.650588: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:8-294     [024] ....    13.650588: dd_queued_print: ins: 0 comp: 2 queued: 4294967294
   kworker/u64:2-288     [014] ....    13.650621: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:2-288     [014] ....    13.650622: dd_queued_print: ins: 0 comp: 2 queued: 4294967294
   kworker/u64:7-293     [031] ....    13.650642: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:7-293     [031] ....    13.650643: dd_queued_print: ins: 0 comp: 2 queued: 4294967294
   kworker/u64:9-296     [018] ....    13.650653: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:9-296     [018] ....    13.650654: dd_queued_print: ins: 0 comp: 2 queued: 4294967294
   kworker/u64:4-290     [001] ....    13.650690: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:4-290     [001] ....    13.650691: dd_queued_print: ins: 0 comp: 2 queued: 4294967294
   kworker/u64:8-294     [024] ....    13.650764: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:8-294     [024] ....    13.650765: dd_queued_print: ins: 0 comp: 3 queued: 4294967293
   kworker/u64:2-288     [014] ....    13.650774: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:2-288     [014] ....    13.650774: dd_queued_print: ins: 0 comp: 3 queued: 4294967293
   kworker/u64:9-296     [018] ....    13.650800: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:9-296     [018] ....    13.650801: dd_queued_print: ins: 0 comp: 3 queued: 4294967293
   kworker/u64:7-293     [031] ....    13.650833: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:7-293     [031] ....    13.650834: dd_queued_print: ins: 0 comp: 3 queued: 4294967293
   kworker/u64:8-294     [024] ....    13.650865: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:8-294     [024] ....    13.650866: dd_queued_print: ins: 0 comp: 4 queued: 4294967292
...
   kworker/u64:2-288     [014] ....    13.655333: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:2-288     [014] ....    13.655334: dd_queued_print: ins: 0 comp: 19 queued: 4294967277
   kworker/u64:9-296     [018] ....    13.655382: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:9-296     [018] ....    13.655383: dd_queued_print: ins: 0 comp: 19 queued: 4294967277
   kworker/u64:4-290     [001] ...1    13.655564: dd_insert_requests: dd prio: 1 prio class: 0
Here comes the first insert...

   kworker/u64:4-290     [001] ...1    13.655565: dd_queued_print: ins: 1 comp: 12 queued: 4294967285
   kworker/u64:9-296     [018] ...1    13.655565: dd_insert_requests: dd prio: 1 prio class: 0
   kworker/u64:2-288     [014] ...1    13.655565: dd_insert_requests: dd prio: 1 prio class: 0
   kworker/u64:9-296     [018] ...1    13.655566: dd_queued_print: ins: 1 comp: 19 queued: 4294967278
   kworker/u64:2-288     [014] ...1    13.655566: dd_queued_print: ins: 1 comp: 19 queued: 4294967278
   kworker/u64:8-294     [024] ....    13.655662: dd_finish_request: dd prio: 1 prio class: 0

What appears to be happening here is that dd_finish_request() gets called a bunch of times,
without any preceeding dd_insert_requests() call.

Reading the comment above dd_finish_request():

 * Callback from inside blk_mq_free_request().

Could it be that this callback is done on certain requests that was never
sent down to mq-deadline?
Perhaps blk_mq_request_bypass_insert() or blk_mq_try_issue_directly() was
called, and therefore dd_insert_requests() was never called for some of the
ealiest requests in the system, but since e->type->ops.finish_request() is
set, dd_finish_request() gets called on free anyway.

Since dd_queued() is defined as:
	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
And since we can see that we have several calls to dd_finish_request()
that has increased the completed counter, dd_queued() returns a
very high value, since 0 - 19 = 4294967277.

This is probably the bug that causes the bogus accouting of BE reqs.
However, looking at the comment for dd_sum(), it also doesn't feel good
to rely on something that is "slightly outdated" to determine if we
should process a whole io class or not.
Letting requests wait for 10 seconds when there are no other outstanding
requests in the scheduler doesn't seem like the right thing to do.

Kind regards,
Niklas



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux