Re: [PATCH] blk-mq: set default elevator as deadline in case of hctx shared tagset

Ming Lei <ming.lei@xxxxxxxxxx> · Wed, 14 Apr 2021 18:48:28 +0800

On Wed, Apr 14, 2021 at 01:51:01PM +0530, Kashyap Desai wrote:
> >
> > On Wed, Apr 07, 2021 at 09:04:30AM +0100, John Garry wrote:
> > > Reviewed-by: John Garry <john.garry@xxxxxxxxxx>
> > >
> > >
> > > > On Tue, Apr 06, 2021 at 11:25:08PM +0100, John Garry wrote:
> > > > > On 06/04/2021 04:19, Ming Lei wrote:
> > > > >
> > > > > Hi Ming,
> > > > >
> > > > > > Yanhui found that write performance is degraded a lot after
> > > > > > applying hctx shared tagset on one test machine with
> > > > > > megaraid_sas. And turns out it is caused by none scheduler which
> > > > > > becomes default elevator caused by hctx shared tagset patchset.
> > > > > >
> > > > > > Given more scsi HBAs will apply hctx shared tagset, and the
> > > > > > similar performance exists for them too.
> > > > > >
> > > > > > So keep previous behavior by still using default mq-deadline for
> > > > > > queues which apply hctx shared tagset, just like before.
> > > > > I think that there a some SCSI HBAs which have nr_hw_queues > 1
> > > > > and don't use shared sbitmap - do you think that they want want
> > > > > this as well (without knowing it)?
> 
> John - I have noted this and discussing internally.
> This patch fixing shared host tag behavior is good (and required to intact
> earlier behavior) but for <mpi3mr> which is true multi hardware queue
> interface, I will update later.
> In general most of the OS vendor recommend <mq-deadline> for rotational
> media and <none> for non-rotational media. We would like to go with this
> method in <mpi3mr> driver.
> 
> 
> > > > I don't know but none has been used for them since the beginning, so
> > > > not an regression of shared tagset, but this one is really.
> > >
> > > It seems fine to revert to previous behavior when host_tagset is set.
> > > I didn't check the results for this recently, but for the original
> > > shared tagset patchset [0] I had:
> > >
> > > none sched:		2132K IOPS
> > > mq-deadline sched:	2145K IOPS
> 
> On my local setup also I did not see much difference.
> 
> >
> > BTW, Yanhui reported that sequential write on virtio-scsi drops by
> 40~70% in
> > VM, and the virito-scsi is backed by file image on XFS over
> megaraid_sas. And
> > the disk is actually SSD, instead of HDD. It could be worse in case of
> > megaraid_sas HDD.
> 
> Ming -  If we have old megaraid_sas driver (without host tag set patch),
> and just toggling io-scheduler from <none> to <mq-deadline> (through
> sysfs) also gives similar performance drop.  ?

The default io sched for old megraid_sas is mq-deadline, which
performs very well in Yanhui's virt workloads. And with none, IO
performance drops much with new driver(shared tags).

The disk is INTEL SSDSC2CT06.

> 
> I think performance drop using <none> io scheduler, might be due to bio
> merge is missing compare to mq-deadline. It may not be linked to shared
> host tag IO path.
> Usually bio merge does not help for sequential work load if back-end is
> enterprise SSDs/NVME, but it is not always true. It is difficult to have
> all setup and workload to get benefit from one io-scheduler.

BTW, with mq-deadline & shared tags, CPU utilization is increased by ~20%
in some VM fio test

Thanks, 
Ming