Re: [PATCH] blk-mq: Fix several SCSI request queue lockups

Bart Van Assche <Bart.VanAssche@xxxxxxx> · Mon, 4 Dec 2017 23:32:27 +0000

On Tue, 2017-12-05 at 07:01 +0800, Ming Lei wrote:
> On Mon, Dec 04, 2017 at 10:48:18PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-12-05 at 06:42 +0800, Ming Lei wrote:
> > > On Mon, Dec 04, 2017 at 09:30:32AM -0800, Bart Van Assche wrote:
> > > > * A systematic lockup for SCSI queues with queue depth 1. The
> > > >   following test reproduces that bug systematically:
> > > >   - Change the SRP initiator such that SCSI target queue depth is
> > > >     limited to 1.
> > > >   - Run the following command:
> > > >       srp-test/run_tests -f xfs -d -e none -r 60 -t 01
> > > >   See also "[PATCH 4/7] blk-mq: Avoid that request processing
> > > >   stalls when sharing tags"
> > > >   (https://marc.info/?l=linux-block&m=151208695316857). Note:
> > > >   reverting commit 0df21c86bdbf also fixes a sporadic SCSI request
> > > >   queue lockup while inserting a blk_mq_sched_mark_restart_hctx()
> > > >   before all blk_mq_dispatch_rq_list() calls only fixes the
> > > >   systematic lockup for queue depth 1.
> > > 
> > > You are the only reproducer [ ... ]
> > 
> > That's not correct. I'm pretty sure if you try to reproduce this that
> > you will see the same hang I ran into. Does this mean that you have not
> > yet tried to reproduce the hang I reported?
> 
> Do you mean every kernel developer has to own one SRP/IB hardware?

When I have the time I will make it possible to run this test on any system
equipped with at least one Ethernet port. But the fact that the test I
mentioned requires IB hardware should not prevent you from running this test
since you have run this test software before.

> I don't have your hardware to reproduce that,

That's not true. Your employer definitely owns IB hardware. E.g. the
following message shows that you have run the srp-test yourself on IB hardware
only four weeks ago:

https://www.spinics.net/lists/linux-block/msg19511.html

> Otherwise, there should have be such similar reports from others, not from
> only you.

That's not correct either. How long was it ago that kernel v4.15-rc1 was
released? One week? How many SRP users do you think have tried to trigger a
queue full condition with that kernel version?

> More importantly I don't understand why you can't share the kernel
> log/debugfs log when IO hang happens?
> 
> Without any kernel log, how can we confirm that it is a valid report?

It's really unfortunate that you are focussing on denying that the bug I
reported exists instead of trying to fix the bugs introduced by commit
b347689ffbca. BTW, I have more than enough experience to decide myself what
a valid report is and what not. I can easily send you several MB of kernel
logs. The reason I have not yet done this is because I'm 99.9% sure that
these won't help to root cause the reported issue. But I can tell you what
I learned from analyzing the information under /sys/kernel/debug/block:
every time a hang occurred I noticed that no requests were "busy", that two
requests occurred in rq_lists and one request occurred in a hctx dispatch
list. This is enough to conclude that a queue run was missing. And I think
that the patch at the start of this e-mail thread not only shows that the
root cause is in the block layer but also that this bug was introduced by
commit b347689ffbca.

> > > You said that your patch fixes 'commit b347689ffbca ("blk-mq-sched:
> > > improve dispatching from sw queue")', but you don't mention any issue
> > > about that commit.
> > 
> > That's not correct either. From the commit message "A systematic lockup
> > for SCSI queues with queue depth 1."
> 
> I mean you mentioned your patch can fix 'commit b347689ffbca
> ("blk-mq-sched: improve dispatching from sw queue")', but you never
> point where the commit b347689ffbca is wrong, how your patch fixes
> the mistake of that commit.

You should know that it is not required to perform a root cause analysis
before posting a revert. Having performed a bisect is sufficient.

BTW, it seems like you forgot that last Friday I explained to you that there
is an obvious bug in commit b347689ffbca, namely that a blk_mq_sched_mark_restart_hctx()
call is missing in blk_mq_sched_dispatch_requests() before the blk_mq_do_dispatch_ctx()
call. See also https://marc.info/?l=linux-block&m=151215794224401.

Bart.