On 4/10/18 3:55 AM, Christoph Hellwig wrote: > On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote: >> If a completion occurs after blk_mq_rq_timed_out() has reset >> rq->aborted_gstate and the request is again in flight when the timeout >> expires then a request will be completed twice: a first time by the >> timeout handler and a second time when the regular completion occurs. >> >> Additionally, the blk-mq timeout handling code ignores completions that >> occur after blk_mq_check_expired() has been called and before >> blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver >> timeout handler always returns BLK_EH_RESET_TIMER then the result will >> be that the request never terminates. >> >> Since the request state can be updated from two different contexts, >> namely regular completion and request timeout, this race cannot be >> fixed with RCU synchronization only. Fix this race as follows: >> - Split __deadline in two variables, namely lq_deadline for legacy >> request queues and mq_deadline for blk-mq request queues. Use atomic >> operations to update mq_deadline. > > I don't think we need the atomic_long_cmpxchg, and can do with a plain > cmpxhg. Also unterminated cmpxchg loops are a bad idea, but I think > both callers are protected from other changes so we can turn that > into a WARN_ON(). That plus some minor cleanups in the version below, > only boot tested: This version looks reasonable, let me throw it through the testing machinery. -- Jens Axboe