On Wed, 2018-02-07 at 12:09 -0800, tj@xxxxxxxxxx wrote: > Hello, > > On Wed, Feb 07, 2018 at 07:03:56PM +0000, Bart Van Assche wrote: > > I tried the above patch but already during the first iteration of the test I > > noticed that the test hung, probably due to the following request that got stuck: > > > > $ (cd /sys/kernel/debug/block && grep -aH . */*/*/rq_list) > > 00000000a98cff60 {.op=SCSI_IN, .cmd_flags=, .rq_flags=MQ_INFLIGHT|PREEMPT|QUIET|IO_STAT|PM, > > .state=idle, .tag=22, .internal_tag=-1, .cmd=Synchronize Cache(10) 35 00 00 00 00 00, .retries=0, > > .result = 0x0, .flags=TAGGED, .timeout=60.000, allocated 872.690 s ago} > > I'm wonder how this happened, so we can lose a completion when it > races against BLK_EH_RESET_TIMER; however, the command should timeout > later cuz the timer is running again now. Maybe we actually had the > memory barrier race that you pointed out in the other message? Hello Tejun, The patch that I used in my test had an smp_wmb() call (see also below). Anyway, I will see whether I can extract more state information through debugfs. diff --git a/block/blk-mq.c b/block/blk-mq.c index ef4f6df0f1df..8eb2105d82b7 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -827,13 +827,9 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) __blk_mq_complete_request(req); break; case BLK_EH_RESET_TIMER: - /* - * As nothing prevents from completion happening while - * ->aborted_gstate is set, this may lead to ignored - * completions and further spurious timeouts. - */ - blk_mq_rq_update_aborted_gstate(req, 0); blk_add_timer(req); + smp_wmb(); + blk_mq_rq_update_aborted_gstate(req, 0); break; case BLK_EH_NOT_HANDLED: break;