Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold

Ming Lei <ming.lei@xxxxxxxxxx> · Wed, 12 Jul 2017 10:30:29 +0800

On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> On 07/11/2017 12:21 PM, Ming Lei wrote:
> > When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can
> > consider that there is congestion in either low level
> > driver or hardware.
> > 
> > This patch uses EWMA to estimate this congestion threshold,
> > then this threshold can be used to detect/avoid congestion.
> 
> This whole patch set lacks some sort of reasoning why this
> makes sense. I'm assuming you want to reduce unnecessary
> restarts of the queue?

When low level drivers returns BLK_STS_RESOURCE, it means
either driver or hardware can't handle the incoming requests.
The issue is actually one problem of congestion control, IMO.

There are some ways taken in drivers to deal with the issue:

	1) both virtio-blk/xen-blkfront just stops queue in this
	case, then restart the queue when one request is completed.

	2) virtio-scsi chooses to not stop queue

For reaching good IO performance, we often do the following two
things:
	- try to make the queue pipeline as full by increasing queue depth
	- run several tasks to do IO at the same time

So for the soltion 1) above, it can't be efficient, because when one request
is completed to restart queue, there are lots of requests in the queue,
and dispatching these requests will cause to return lots of BLK_STS_RESOURCE
too. The dispatch isn't cheap at all, we need to acquire sw queue
lock/scheduler lock/hw queue lock /low level queue lock in the whole
path, so we should try to avoid dispatching to a busy queue.

For the solution 2), it simply doesn't stop queue, unnecessary
dispatching to a busy queue just wastes CPU, and has the same issue
with 1) too. As you can see in the commit log of patch 5, big
improvement is observed with this simple implementation:

	sequential read test(libaio, bs:4k, direct io, queue depth:64, 8 jobs)
	on virtio-scsi shows that:
	        - CPU utilization decreases ~20%
	        - IOPS increases by ~10%

EWMA is used to estimate one average queue depth, with which
the queue will often be busy. In this patchset, the average queue
depth when queue becomes busy is called congestion threshold.

I chooose EWMA because it is one usual/common way to figure out
weight average value, and it has been used in several fields of
kernel(wifi rate control, sequential I/O estimation in bcache,
network,...)  Or there is other better way to do that? I am happy
to take it, even in the future we may support more than one approaches
to address the issue.

Also another motivation is to unexport APIs of start/stop queue.

> I would much rather ensure that we only
> start when we absolutely have to to begin with, I'm pretty sure
> we have a number of cases where that is not so.

Could you share us these cases? I believe we can integrate different
approaches for congestion control to address different requirement.

> 
> What happens with fluid congestion boundaries, with shared tags?

The approach in this patch should work, but the threshold may not
be accurate in this way, one simple method is to use the average
tag weight in EWMA, like this:

	sbitmap_weight() / hctx->tags->active_queues

-- 
Ming