Re: slow IO on USB media

Ming Lei <ming.lei@xxxxxxxxxx> · Wed, 1 Jan 2020 21:53:31 +0800

On Wed, Jan 01, 2020 at 03:43:10PM +0800, Hillf Danton wrote:
> 
> On Thu, 26 Dec 2019 16:37:06 +0800 Ming Lei wrote:
> > On Wed, Dec 25, 2019 at 10:30:57PM -0500, Theodore Y. Ts'o wrote:
> > > On Thu, Dec 26, 2019 at 10:27:02AM +0800, Ming Lei wrote:
> > > > Maybe we need to be careful for HDD., since the request count in scheduler
> > > > queue is double of in-flight request count, and in theory NCQ should only
> > > > cover all in-flight 32 requests. I will find a sata HDD., and see if
> > > > performance drop can be observed in the similar 'cp' test.
> > >
> > > Please try to measure it, but I'd be really surprised if it's
> > > significant with with modern HDD's.
> > 
> > Just find one machine with AHCI SATA, and run the following xfs
> > overwrite test:
> > 
> > #!/bin/bash
> > DIR=$1
> > echo 3 > /proc/sys/vm/drop_caches
> > fio --readwrite=write --filesize=5g --overwrite=1 --filename=$DIR/fiofile \
> >         --runtime=60s --time_based --ioengine=psync --direct=0 --bs=4k
> > 		--iodepth=128 --numjobs=2 --group_reporting=1 --name=overwrite
> > 
> > FS is xfs, and disk is LVM over AHCI SATA with NCQ(depth 32), because the
> > machine is picked up from RH beaker, and it is the only disk in the box.
> > 
> > #lsblk
> > NAME                            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> > sda                               8:0    0 931.5G  0 disk
> > =E2=94=9C=E2=94=80sda1                            8:1    0     1G  0 part /boot
> > =E2=94=94=E2=94=80sda2                            8:2    0 930.5G  0 part
> >   =E2=94=9C=E2=94=80rhel_hpe--ml10gen9--01-root 253:0    0    50G  0 lvm  /
> >   =E2=94=9C=E2=94=80rhel_hpe--ml10gen9--01-swap 253:1    0   3.9G  0 lvm  [SWAP]
> >   =E2=94=94=E2=94=80rhel_hpe--ml10gen9--01-home 253:2    0 876.6G  0 lvm  /home
> > 
> > 
> > kernel: 3a7ea2c483a53fc("scsi: provide mq_ops->busy() hook") which is
> > the previous commit of f664a3cc17b7 ("scsi: kill off the legacy IO path").
> > 
> >             |scsi_mod.use_blk_mq=N |scsi_mod.use_blk_mq=Y |
> > -----------------------------------------------------------
> > throughput: |244MB/s               |169MB/s               |
> > -----------------------------------------------------------
> > 
> > Similar result can be observed on v5.4 kernel(184MB/s) with same test
> > steps.
> 
> 
> The simple diff makes direct issue of requests take pending requests
> also into account and goes the nornal enqueue-and-dequeue path if any
> pending requests exist.
> 
> Then it sorts requests regardless of the number of hard queues in a
> bid to make requests as sequencial as they are. Let's see if the
> added sorting cost can make any sense.
> 
> --->8---
> 
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -410,6 +410,11 @@ run:
>  		blk_mq_run_hw_queue(hctx, async);
>  }
>  
> +static inline bool blk_mq_sched_hctx_has_pending_rq(struct blk_mq_hw_ctx *hctx)
> +{
> +	return sbitmap_any_bit_set(&hctx->ctx_map);
> +}
> +
>  void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx,
>  				  struct blk_mq_ctx *ctx,
>  				  struct list_head *list, bool run_queue_async)
> @@ -433,7 +438,8 @@ void blk_mq_sched_insert_requests(struct
>  		 * busy in case of 'none' scheduler, and this way may save
>  		 * us one extra enqueue & dequeue to sw queue.
>  		 */
> -		if (!hctx->dispatch_busy && !e && !run_queue_async) {
> +		if (!hctx->dispatch_busy && !e && !run_queue_async &&
> +		    !blk_mq_sched_hctx_has_pending_rq(hctx)) {
>  			blk_mq_try_issue_list_directly(hctx, list);
>  			if (list_empty(list))
>  				goto out;
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1692,7 +1692,7 @@ void blk_mq_flush_plug_list(struct blk_p
>  
>  	list_splice_init(&plug->mq_list, &list);
>  
> -	if (plug->rq_count > 2 && plug->multiple_queues)
> +	if (plug->rq_count > 1)
>  		list_sort(NULL, &list, plug_rq_cmp);
>  
>  	plug->rq_count = 0;

I guess you may not understand the reason, and the issue is related
with neither MQ nor plug.

AHCI/SATA is single queue drive, and for HDD. IO throughput is very
sensitive with IO order in case of sequential IO.

Legacy IO path supports ioc batching and BDI queue congestion. When
there are more than one writeback IO paths, there may be only one
active IO submission path, meantime others are blocked attributed to
ioc batching, so writeback IO is still dispatched to disk in strict
IO order.

But ioc batching and BDI queue congestion is killed when converting to
blk-mq.

Please see the following IO trace with legacy IO request path:

https://lore.kernel.org/linux-scsi/f82fd5129e3dcacae703a689be60b20a7fedadf6.camel@xxxxxxxx/2-log_ming_20191128_182751.zip

Thanks,
Ming