Re: [PATCH v2] block: I/O error occurs during SATA disk stress test On 8/25/22 00:09, Gu Mi wrote: > The problem occurs in two async processes, One is when a new IO calls > the blk_mq_start_request() interface to start sending,The other is > that the block layer timer process calls the blk_mq_req_expired > interface to check whether there is an IO timeout. > > When an instruction out of sequence occurs between blk_add_timer and > WRITE_ONCE(rq->state,MQ_RQ_IN_FLIGHT) in the interface > blk_mq_start_request,at this time, the block timer is checking the new > IO timeout, Since the req status has been set to MQ_RQ_IN_FLIGHT and > req->deadline is 0 at this time, the new IO will be misjudged as a > timeout. > > Our repair plan is for the deadline to be 0, and we do not think that > a timeout occurs. At the same time, because the jiffies of the 32-bit > system will be reversed shortly after the system is turned on, we will > add 1 jiffies to the deadline at this time. Hi Gu, With which kernel version has this race been observed? Since commit 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter") the request reference count is increased before the timeout handler (blk_mq_check_expired()) is called. Do you agree that since then it is no longer possible that blk_mq_start_request() is called while blk_mq_check_expired() is in progress? Thanks, Bart. --- Hi Bart, This problem occurs on kernel version 5.10, and i read this commit you mentioned. The problem I observed is not a problem of req re-used fixed by commit 2e315dc07df0, but a different problem. The specific scene is this: A new IO has called blk_mq_start_request() to start sending, and an instruction out of sequence occurs between blk_add_timer() and WRITE_ONCE(rq->state,MQ_RQ_IN_FLIGHT) in blk_mq_start_request(), so the req->state is set to MQ_RQ_IN_FLIGHT, but req->deadline still 0, and at this very moment, timeout handler(blk_mq_check_expired()) check if this new IO times out, this condition(if (time_after_eq(jiffies, deadline)) in blk_mq_req_expired() called by blk_mq_check_expired()) will is true. The end result is that this new IO is considered to have timed out. I looked at the latest kernel code and the problem persists, do you agree with my analysis process? Thanks, Gu Mi.