On 11/3/21 9:16 AM, Ming Lei wrote: > On Wed, Nov 03, 2021 at 09:10:17AM -0600, Jens Axboe wrote: >> On 11/3/21 9:03 AM, Jens Axboe wrote: >>> On 11/3/21 8:57 AM, Ming Lei wrote: >>>> On Wed, Nov 03, 2021 at 09:59:02PM +0800, Yi Zhang wrote: >>>>> On Wed, Nov 3, 2021 at 7:59 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>>>>> >>>>>> On 11/2/21 9:54 PM, Jens Axboe wrote: >>>>>>> On Nov 2, 2021, at 9:52 PM, Ming Lei <ming.lei@xxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> On Tue, Nov 02, 2021 at 09:21:10PM -0600, Jens Axboe wrote: >>>>>>>>>> On 11/2/21 8:21 PM, Yi Zhang wrote: >>>>>>>>>>>> >>>>>>>>>>>> Can either one of you try with this patch? Won't fix anything, but it'll >>>>>>>>>>>> hopefully shine a bit of light on the issue. >>>>>>>>>>>> >>>>>>>>>> Hi Jens >>>>>>>>>> >>>>>>>>>> Here is the full log: >>>>>>>>> >>>>>>>>> Thanks! I think I see what it could be - can you try this one as well, >>>>>>>>> would like to confirm that the condition I think is triggering is what >>>>>>>>> is triggering. >>>>>>>>> >>>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>>>>>>> index 07eb1412760b..81dede885231 100644 >>>>>>>>> --- a/block/blk-mq.c >>>>>>>>> +++ b/block/blk-mq.c >>>>>>>>> @@ -2515,6 +2515,8 @@ void blk_mq_submit_bio(struct bio *bio) >>>>>>>>> if (plug && plug->cached_rq) { >>>>>>>>> rq = rq_list_pop(&plug->cached_rq); >>>>>>>>> INIT_LIST_HEAD(&rq->queuelist); >>>>>>>>> + WARN_ON_ONCE(q->elevator && !(rq->rq_flags & RQF_ELV)); >>>>>>>>> + WARN_ON_ONCE(!q->elevator && (rq->rq_flags & RQF_ELV)); >>>>>>>>> } else { >>>>>>>>> struct blk_mq_alloc_data data = { >>>>>>>>> .q = q, >>>>>>>>> @@ -2535,6 +2537,8 @@ void blk_mq_submit_bio(struct bio *bio) >>>>>>>>> bio_wouldblock_error(bio); >>>>>>>>> goto queue_exit; >>>>>>>>> } >>>>>>>>> + WARN_ON_ONCE(q->elevator && !(rq->rq_flags & RQF_ELV)); >>>>>>>>> + WARN_ON_ONCE(!q->elevator && (rq->rq_flags & RQF_ELV)); >>>>>>>> >>>>>>>> Hello Jens, >>>>>>>> >>>>>>>> I guess the issue could be the following code run without grabbing >>>>>>>> ->q_usage_counter from blk_mq_alloc_request() and blk_mq_alloc_request_hctx(). >>>>>>>> >>>>>>>> .rq_flags = q->elevator ? RQF_ELV : 0, >>>>>>>> >>>>>>>> then elevator is switched to real one from none, and check on q->elevator >>>>>>>> becomes not consistent. >>>>>>> >>>>>>> Indeed, that’s where I was going with this. I have a patch, testing it >>>>>>> locally but it’s getting late. Will send it out tomorrow. The nice >>>>>>> benefit is that it allows dropping the weird ref get on plug flush, >>>>>>> and batches getting the refs as well. >>>>>> >>>>>> Yi/Steffen, can you try pulling this into your test kernel: >>>>>> >>>>>> git://git.kernel.dk/linux-block for-next >>>>>> >>>>>> and see if it fixes the issue for you. Thanks! >>>>> >>>>> It still can be reproduced with the latest linux-block/for-next, here is the log >>>>> >>>>> fab2914e46eb (HEAD, new/for-next) Merge branch 'for-5.16/drivers' into for-next >>>> >>>> Hi Yi, >>>> >>>> Please try the following change: >>>> >>>> >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>> index e1e64964a31b..eb634a9c61ff 100644 >>>> --- a/block/blk-mq.c >>>> +++ b/block/blk-mq.c >>>> @@ -494,7 +494,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, >>>> .q = q, >>>> .flags = flags, >>>> .cmd_flags = op, >>>> - .rq_flags = q->elevator ? RQF_ELV : 0, >>>> .nr_tags = 1, >>>> }; >>>> struct request *rq; >>>> @@ -504,6 +503,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, >>>> if (ret) >>>> return ERR_PTR(ret); >>>> >>>> + data.rq_flags = q->elevator ? RQF_ELV : 0, >>>> rq = __blk_mq_alloc_requests(&data); >>>> if (!rq) >>>> goto out_queue_exit; >>>> @@ -524,7 +524,6 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, >>>> .q = q, >>>> .flags = flags, >>>> .cmd_flags = op, >>>> - .rq_flags = q->elevator ? RQF_ELV : 0, >>>> .nr_tags = 1, >>>> }; >>>> u64 alloc_time_ns = 0; >>>> @@ -551,6 +550,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, >>>> ret = blk_queue_enter(q, flags); >>>> if (ret) >>>> return ERR_PTR(ret); >>>> + data.rq_flags = q->elevator ? RQF_ELV : 0, >>> >>> Don't think that will compile, but I guess the point is that we can't do >>> this assignment before queue enter, in case we're in the midst of >>> switching schedulers. Which is indeed a valid concern. >> >> Something like the below. Maybe? On top of the for-next that was already >> pulled in. >> >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index b01e05e02277..121f1898d529 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -433,9 +433,11 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data) >> if (data->cmd_flags & REQ_NOWAIT) >> data->flags |= BLK_MQ_REQ_NOWAIT; >> >> - if (data->rq_flags & RQF_ELV) { >> + if (q->elevator) { >> struct elevator_queue *e = q->elevator; >> >> + data->rq_flags |= RQF_ELV; >> + >> /* >> * Flush/passthrough requests are special and go directly to the >> * dispatch list. Don't include reserved tags in the >> @@ -494,7 +496,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, >> .q = q, >> .flags = flags, >> .cmd_flags = op, >> - .rq_flags = q->elevator ? RQF_ELV : 0, >> .nr_tags = 1, >> }; >> struct request *rq; >> @@ -524,7 +525,6 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, >> .q = q, >> .flags = flags, >> .cmd_flags = op, >> - .rq_flags = q->elevator ? RQF_ELV : 0, >> .nr_tags = 1, >> }; >> u64 alloc_time_ns = 0; >> @@ -565,6 +565,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, >> >> if (!q->elevator) >> blk_mq_tag_busy(data.hctx); >> + else >> + data.rq_flags |= RQF_ELV; >> >> ret = -EWOULDBLOCK; >> tag = blk_mq_get_tag(&data); >> @@ -2560,7 +2562,6 @@ void blk_mq_submit_bio(struct bio *bio) >> .q = q, >> .nr_tags = 1, >> .cmd_flags = bio->bi_opf, >> - .rq_flags = q->elevator ? RQF_ELV : 0, >> }; > > The above patch looks fine. > > BTW, 9ede85cb670c ("block: move queue enter logic into > blk_mq_submit_bio()") moves the queue enter into blk_mq_submit_bio(), > which seems dangerous, especially blk_mq_sched_bio_merge() needs > hctx/ctx which requires q_usage_counter to be grabbed. I think the best solution is to enter just for that as well, and just retain that enter state. I'll update the patch, there's some real fixes in there too for the batched alloc. Will post them later today. -- Jens Axboe