Re: [PATCH 0/4] blk-mq: support to use hw tag for scheduling

Ming Lei <ming.lei@xxxxxxxxxx> · Wed, 3 May 2017 23:38:09 +0800



On Wed, May 03, 2017 at 09:08:34AM -0600, Jens Axboe wrote:
> On 05/03/2017 09:03 AM, Ming Lei wrote:
> > On Wed, May 03, 2017 at 08:10:58AM -0600, Jens Axboe wrote:
> >> On 05/03/2017 08:08 AM, Jens Axboe wrote:
> >>> On 05/02/2017 10:03 PM, Ming Lei wrote:
> >>>> On Fri, Apr 28, 2017 at 02:29:18PM -0600, Jens Axboe wrote:
> >>>>> On 04/28/2017 09:15 AM, Ming Lei wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> This patchset introduces flag of BLK_MQ_F_SCHED_USE_HW_TAG and
> >>>>>> allows to use hardware tag directly for IO scheduling if the queue's
> >>>>>> depth is big enough. In this way, we can avoid to allocate extra tags
> >>>>>> and request pool for IO schedule, and the schedule tag allocation/release
> >>>>>> can be saved in I/O submit path.
> >>>>>
> >>>>> Ming, I like this approach, it's pretty clean. It'd be nice to have a
> >>>>> bit of performance data to back up that it's useful to add this code,
> >>>>> though.  Have you run anything on eg kyber on nvme that shows a
> >>>>> reduction in overhead when getting rid of separate scheduler tags?
> >>>>
> >>>> I can observe small improvement in the following tests:
> >>>>
> >>>> 1) fio script
> >>>> # io scheduler: kyber
> >>>>
> >>>> RWS="randread read randwrite write"
> >>>> for RW in $RWS; do
> >>>>         echo "Running test $RW"
> >>>>         sudo echo 3 > /proc/sys/vm/drop_caches
> >>>>         sudo fio --direct=1 --size=128G --bsrange=4k-4k --runtime=20 --numjobs=1 --ioengine=libaio --iodepth=10240 --group_reporting=1 --filename=$DISK --name=$DISK-test-$RW --rw=$RW --output-format=json
> >>>> done
> >>>>
> >>>> 2) results
> >>>>
> >>>> ---------------------------------------------------------
> >>>> 			|sched tag(iops/lat)	| use hw tag to sched(iops/lat)
> >>>> ----------------------------------------------------------
> >>>> randread	|188940/54107			| 193865/52734
> >>>> ----------------------------------------------------------
> >>>> read		|192646/53069			| 199738/51188
> >>>> ----------------------------------------------------------
> >>>> randwrite	|171048/59777			| 179038/57112
> >>>> ----------------------------------------------------------
> >>>> write		|171886/59492			| 181029/56491
> >>>> ----------------------------------------------------------
> >>>>
> >>>> I guess it may be a bit more obvious when running the test on one slow
> >>>> NVMe device, and will try to find one and run the test again.
> >>>
> >>> Thanks for running that. As I said in my original reply, I think this
> >>> is a good optimization, and the implementation is clean. I'm fine with
> >>> the current limitations of when to enable it, and it's not like we
> >>> can't extend this later, if we want.
> >>>
> >>> I do agree with Bart that patch 1+4 should be combined. I'll do that.
> >>
> >> Actually, can you do that when reposting? Looks like you needed to
> >> do that anyway.
> > 
> > Yeah, I will do that in V1.
> 
> V2? :-)
> 
> Sounds good. I just wanted to check the numbers here, with the series
> applied on top of for-linus crashes when switching to kyber. A few hunks

Yeah, I saw that too, it has been fixed in my local tree, :-)

> threw fuzz, but it looked fine to me. But I bet I fat fingered
> something.  So it'd be great if you could respin against my for-linus
> branch.

Actually, that is exactly what I am doing, :-)

Thanks,
Ming