On Sat, Mar 16, 2019 at 12:15 AM Keith Busch <kbusch@xxxxxxxxxx> wrote: > > On Fri, Mar 15, 2019 at 01:57:38AM -0700, Jianchao Wang wrote: > > tags->rqs[] will not been cleaned when free driver tag and there > > is a window between get driver tag and write tags->rqs[], so we > > may see stale rq in tags->rqs[] which may have been freed, as > > following case, > > blk_mq_get_request blk_mq_queue_tag_busy_iter > > -> blk_mq_get_tag > > -> bt_for_each > > -> bt_iter > > -> rq = taags->rqs[] > > -> rq->q > > -> blk_mq_rq_ctx_init > > -> data->hctx->tags->rqs[rq->tag] = rq; > > > > To fix this, the blk_mq_queue_tag_busy_iter is changed in this > > patch to use tags->static_rqs[] instead of tags->rqs[]. We have > > to identify whether there is a io scheduler attached to decide > > to use hctx->tags or hctx->sched_tags. And we will try to get a > > non-zero q_usage_counter before that, so it is safe to access > > them. Add 'inflight' parameter to determine to iterate in-flight > > requests or just busy tags. A correction here is that > > part_in_flight should count the busy tags instead of rqs that > > have got driver tags. > > > > Moreover, get rid of the 'hctx' parameter in iter function as > > we could get it from rq->mq_hctx and export this interface for > > drivers. > > > > Signed-off-by: Jianchao Wang <jianchao.w.wang@xxxxxxxxxx> > > --- > > block/blk-mq-tag.c | 76 +++++++++++++++++++++++++++++++++----------------- > > block/blk-mq-tag.h | 2 -- > > block/blk-mq.c | 31 ++++++++------------ > > include/linux/blk-mq.h | 5 ++-- > > 4 files changed, 64 insertions(+), 50 deletions(-) > > > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > > index b792537..cdec2cd 100644 > > --- a/block/blk-mq-tag.c > > +++ b/block/blk-mq-tag.c > > @@ -216,26 +216,38 @@ struct bt_iter_data { > > busy_iter_fn *fn; > > void *data; > > bool reserved; > > + bool inflight; > > }; > > > > static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) > > { > > struct bt_iter_data *iter_data = data; > > struct blk_mq_hw_ctx *hctx = iter_data->hctx; > > - struct blk_mq_tags *tags = hctx->tags; > > bool reserved = iter_data->reserved; > > + struct blk_mq_tags *tags; > > struct request *rq; > > > > + tags = hctx->sched_tags ? hctx->sched_tags : hctx->tags; > > + > > if (!reserved) > > bitnr += tags->nr_reserved_tags; > > - rq = tags->rqs[bitnr]; > > + /* > > + * Because tags->rqs[] will not been cleaned when free driver tag > > + * and there is a window between get driver tag and write tags->rqs[], > > + * so we may see stale rq in tags->rqs[] which may have been freed. > > + * Using static_rqs[] is safer. > > + */ > > + rq = tags->static_rqs[bitnr]; > > > > /* > > - * We can hit rq == NULL here, because the tagging functions > > - * test and set the bit before assigning ->rqs[]. > > + * There is a small window between get tag and blk_mq_rq_ctx_init, > > + * so rq->q and rq->mq_hctx maybe different. > > */ > > - if (rq && rq->q == hctx->queue) > > - return iter_data->fn(hctx, rq, iter_data->data, reserved); > > + if (rq && rq->q == hctx->queue && > > + rq->mq_hctx == hctx && > > + (!iter_data->inflight || > > + blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT)) > > + return iter_data->fn(rq, iter_data->data, reserved); > > There is still a window where the check may succeed, but the request is > being assigned to a completely different request_queue. The callback > then operates on a request it doesn't own. The situation above is only related with 'none' io scheduler. However, this patch doesn't change behavior for none given .rq[tag] is same with .static_rq[tag], so I guess your concern may not be related with this patch. > > Tag iteration from a driver can be safe only if the driver initiates a > freeze and quiesced all queues sharing the same tagset first. The nvme > driver does that, but I think there may need to be an additional grace > period to wait for the allocation's queue_enter to call queue_exit to > ensure the request is initialized through blk_mq_rq_ctx_init(). blk_mq_queue_tag_busy_iter() is used by blk-mq core code only for io accounting or timeout on the specified request_queue, and NVMe's uses are actually over the whole driver tags. Thanks, Ming Lei