Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart

Roman Penyaev <roman.penyaev@xxxxxxxxxxxxxxxx> · Fri, 20 Oct 2017 11:39:24 +0200

Hi Bart,

On Thu, Oct 19, 2017 at 7:47 PM, Bart Van Assche <Bart.VanAssche@xxxxxxx> wrote:
> On Wed, 2017-10-18 at 12:22 +0200, Roman Pen wrote:
>> the patch below fixes queue stalling when shared hctx marked for restart
>> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
>> root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
>> belongs to the particular queue, which in fact may not need to be restarted,
>> thus we return from blk_mq_sched_restart() and leave shared hctx of another
>> queue never restarted.
>>
>> The fix is to make shared_hctx_restart counter belong not to the queue, but
>> to tags, thereby counter will reflect real number of shared hctx needed to
>> be restarted.
>
> Hello Roman,
>
> The patch you posted looks fine to me but seeing this patch and the patch
> description makes me wonder why this had not been noticed before.

This is a good question, which I could not answer.  I tried to simulate the
same behaviour (completion timings, completion pinning, number of submission
queues, shared tags, etc) on null block.  but what I see is that
*_sched_restart()
never observes 'shared_hctx_restart',  literally never (I made a counter when
we take a path and start looking for a hctx to restart, and a counter stays 0).

That makes me nervous and then I gave up.  After some time I want return to
that and try to reproduce the problem on something else, say nvme.

> Are you perhaps using a block driver that returns BLK_STS_RESOURCE more
> often than other block drivers? Did you perhaps run into this with the
> Infiniband network block device (IBNBD) driver?

Yep, this is IBNBD, but in these tests I tested with mq scheduler, shared tags
and 1 hctx for each queue (blk device),  thus I never run out of internal tags
and never return BLK_STS_RESOURCE.

Indeed, not modified IBNBD does internal tags management.  This was needed
because each queue (block device) was created with hctx number (nr_hw_queues)
equal to number of cpus on the system, but blk-mq tags set is shared only
between hctx, not globally, which led to need to return BLK_STS_RESOURCE
and queues restarts.

But, with mq scheduler situation changed: 1 hctx with shared tags can be
specified for all hundreds of devices without any performance impact.

Testing this configuration (1 hctx, shared tags, mq-deadline) immediately
shows these two problems: request stalling and slow loops inside
blk_mq_sched_restart().

> No matter what driver triggered this, I think this bug should be fixed.

Yes, queue stalling can be easily fixed.  I can resend current patch with
shorter description which targets only this particular bug, if no one else
has objections/comments etc.

But what bothers me is these looong loops inside blk_mq_sched_restart(),
and since you are the author of the original 6d8c6c0f97ad ("blk-mq: Restart
a single queue if tag sets are shared") I want to ask what was the original
problem which you attempted to fix?  Likely I am missing some test scenario
which would be great to know about.

--
Roman