Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

Bart Van Assche <Bart.VanAssche@xxxxxxx> · Tue, 10 Apr 2018 14:30:26 +0000

On Tue, 2018-04-10 at 07:20 -0700, Tejun Heo wrote:
> On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> > Since the request state can be updated from two different contexts,
> > namely regular completion and request timeout, this race cannot be
> > fixed with RCU synchronization only. Fix this race as follows:
> 
> Well, it can be and the patches have been posted months ago.

That's not correct. I have explained you in detail that the two patches you
posted do not fix all the races fixed by the patch at the start of this
e-mail thread.

> Switching to another model might be better but let's please do that
> with the right rationales.  A good portion of this seems to be built
> on misunderstandings.

Which misunderstandings? I'm not aware of any misunderstandings at my side.
Additionally, tests with two different block drivers (NVMeOF initiator and
the SRP initiator driver) have shown that the current blk-mq timeout
implementation with or without your two patches applied result in subtle and
hard to debug crashes and/or memory corruption. That is not the case for the
patch at the start of this thread. The latest report of a crash I ran into
myself and that is fixed by the patch at the start of this thread is
available here: https://www.spinics.net/lists/linux-rdma/msg63240.html.

Please also keep in mind that if this patch would be accepted that that does
not prevent this patch to be replaced with an RCU-based solution later on.
If anyone comes up any time with a reliably working RCU-based solution I
will be happy to accept a revert of this patch and I will help reviewing that
RCU-based solution.

Bart.