Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2018-02-07 at 09:06 -0800, Tejun Heo wrote:
> On Tue, Feb 06, 2018 at 05:11:33PM -0800, Bart Van Assche wrote:
> > The following race can occur between the code that resets the timer
> > and completion handling:
> > - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
> > - A completion occurs and blk_mq_complete_request() calls
> >   __blk_mq_complete_request().
> > - The timeout code calls blk_add_timer() and that function sets the
> >   request deadline and adjusts the timer.
> > - __blk_mq_complete_request() frees the request tag.
> > - The timer fires and the timeout handler gets called for a freed
> >   request.
> 
> Can you see whether by any chance the following patch fixes the issue?
> If not, can you share the repro case?
> 
> Thanks.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index df93102..651d18c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -836,8 +836,8 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  		 * ->aborted_gstate is set, this may lead to ignored
>  		 * completions and further spurious timeouts.
>  		 */
> -		blk_mq_rq_update_aborted_gstate(req, 0);
>  		blk_add_timer(req);
> +		blk_mq_rq_update_aborted_gstate(req, 0);
>  		break;
>  	case BLK_EH_NOT_HANDLED:
>  		break;

Hello Tejun,

Even with the above change I think that there is still a race between the
code that handles timer resets and the completion handler. Anyway, the test
with which I triggered these races is as follows:
- Start from what will become kernel v4.16-rc1 and apply the patch that adds
  SRP over RoCE support to the ib_srpt driver. See also the "[PATCH v2 00/14]
  IB/srpt: Add RDMA/CM support" patch series
  (https://www.spinics.net/lists/linux-rdma/msg59589.html).
- Apply my patch series that fixes a race between the SCSI error handler and
  SCSI transport recovery.
- Apply my patch series that improves the stability of the SCSI target core
  (LIO).
- Build and install that kernel.
- Clone the following repository: https://github.com/bvanassche/srp-test.
- Run the following test:
  while true; do srp-test/run_tests -c -t 02-mq; done
- While the test is running, check whether or not something weird happens.
  Sometimes I see that scsi_times_out() crashes. Sometimes I see while running
  this test that a soft lockup is reported inside blk_mq_do_dispatch_ctx().

If you want I can share the tree on github that I use myself for my tests.

Thanks,

Bart.




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux