On 12/19/18 10:03 PM, jianchao.wang wrote: > > > On 12/20/18 12:48 PM, Jens Axboe wrote: >> On 12/19/18 9:32 PM, jianchao.wang wrote: >>> >>> >>> On 12/20/18 12:19 PM, Jens Axboe wrote: >>>> On 12/19/18 8:24 PM, jianchao.wang wrote: >>>>> >>>>> >>>>> On 12/20/18 11:17 AM, Jens Axboe wrote: >>>>>> On 12/19/18 5:16 PM, Bart Van Assche wrote: >>>>>>> On Wed, 2018-12-19 at 16:27 -0700, Jens Axboe wrote: >>>>>>>> On 12/19/18 4:24 PM, Bart Van Assche wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> If I run the srp blktests in a loop then I see the below call stack appearing >>>>>>>>> sporadically. I have not yet had the time to analyze this but I'm reporting >>>>>>>>> this here in case someone else would already have had a look at this. >>>>>>>>> >>>>>>>>> Bart. >>>>>>>>> >>>>>>>>> ================================================================== >>>>>>>>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 >>>>>>>>> Read of size 8 at addr ffff88803b335240 by task fio/21412 >>>>>>>>> >>>>>>>>> CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3 >>>>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 >>>>>>>>> Call Trace: >>>>>>>>> dump_stack+0x86/0xca >>>>>>>>> print_address_description+0x71/0x239 >>>>>>>>> kasan_report.cold.5+0x242/0x301 >>>>>>>>> __asan_load8+0x54/0x90 >>>>>>>>> bt_iter+0x86/0xf0 >>>>>>>>> blk_mq_queue_tag_busy_iter+0x373/0x5e0 >>>>>>>>> blk_mq_in_flight+0x96/0xb0 >>>>>>>>> part_in_flight+0x40/0x140 >>>>>>>>> part_round_stats+0x18e/0x370 >>>>>>>>> blk_account_io_start+0x3d7/0x670 >>>>>>>>> blk_mq_bio_to_request+0x19c/0x3a0 >>>>>>>>> blk_mq_make_request+0x7a9/0xcb0 >>>>>>>>> generic_make_request+0x41d/0x960 >>>>>>>>> submit_bio+0x9b/0x250 >>>>>>>>> do_blockdev_direct_IO+0x435c/0x4c70 >>>>>>>>> __blockdev_direct_IO+0x79/0x88 >>>>>>>>> ext4_direct_IO+0x46c/0xc00 >>>>>>>>> generic_file_direct_write+0x119/0x210 >>>>>>>>> __generic_file_write_iter+0x11c/0x280 >>>>>>>>> ext4_file_write_iter+0x1b8/0x6f0 >>>>>>>>> aio_write+0x204/0x310 >>>>>>>>> io_submit_one+0x9d3/0xe80 >>>>>>>>> __x64_sys_io_submit+0x115/0x340 >>>>>>>>> do_syscall_64+0x71/0x210 >>>>>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>>>>>>> RIP: 0033:0x7f02cf043219 >>>>>>>> >>>>>>>> I've seen this one before as well, it's not a new thing. As far as I can >>>>>>>> tell, it's a false positive. There should be no possibility for a >>>>>>>> use-after-free iterating the static tags/requests. >>>>>>> >>>>>>> Are you sure this is a false positive? >>>>>> >>>>>> No I'm not, but the few times I have seen it, I haven't been able to >>>>>> make much sense of it. It goes back quite a bit. >>>>>> >>>>>> I have not yet encountered any false >>>>>>> positive KASAN complaints. According to the following gdb output this complaint >>>>>>> refers to reading rq->q: >>>>>>> >>>>>>> (gdb) list *(bt_iter+0x86) >>>>>>> 0xffffffff816b9346 is in bt_iter (block/blk-mq-tag.c:237). >>>>>>> 232 >>>>>>> 233 /* >>>>>>> 234 * We can hit rq == NULL here, because the tagging functions >>>>>>> 235 * test and set the bit before assigning ->rqs[]. >>>>>>> 236 */ >>>>>>> 237 if (rq && rq->q == hctx->queue) >>>>>>> 238 iter_data->fn(hctx, rq, iter_data->data, reserved); >>>>>>> 239 return true; >>>>>>> 240 } >>>>>>> 241 >>>>>>> >>>>>>> From the disassembly output: >>>>>>> >>>>>>> 232 >>>>>>> 233 /* >>>>>>> 234 * We can hit rq == NULL here, because the tagging functions >>>>>>> 235 * test and set the bit before assigning ->rqs[]. >>>>>>> 236 */ >>>>>>> 237 if (rq && rq->q == hctx->queue) >>>>>>> 0xffffffff816b9339 <+121>: test %r12,%r12 >>>>>>> 0xffffffff816b933c <+124>: je 0xffffffff816b935f <bt_iter+159> >>>>>>> 0xffffffff816b933e <+126>: mov %r12,%rdi >>>>>>> 0xffffffff816b9341 <+129>: callq 0xffffffff813bd3e0 <__asan_load8> >>>>>>> 0xffffffff816b9346 <+134>: lea 0x138(%r13),%rdi >>>>>>> 0xffffffff816b934d <+141>: mov (%r12),%r14 >>>>>>> 0xffffffff816b9351 <+145>: callq 0xffffffff813bd3e0 <__asan_load8> >>>>>>> 0xffffffff816b9356 <+150>: cmp 0x138(%r13),%r14 >>>>>>> 0xffffffff816b935d <+157>: je 0xffffffff816b936f <bt_iter+175> >>>>>>> >>>>>>> BTW, rq may but does not have to refer to tags->static_rqs[...]. It may also >>>>>>> refer to hctx->fq.flush_rq. >>>>>> >>>>>> But even those are persistent for the lifetime of the queue... But since >>>>>> kasan complains it belongs to a specific page, I'm guessing it's one >>>>>> of the regular requests since those are out of a chopped up page. Which >>>>>> means it makes even less sense. >>>>>> >>>>>> Is this happening while devices are being actively torn down? And >>>>>> are you using shared tags? That's the only way I could see this >>>>>> triggering. >>>>>> >>>>> >>>>> Or could it be caused by the stale request in hctx->tags->rqs[] slot ? >>>>> We don't clear it after free the requests. >>>>> >>>>> And there could be a scenario like, >>>>> There used to be a io scheduler attached. >>>>> After some workload, the io scheduler is detached. >>>>> So there could be rqs allocated by the io scheduler left in hctx->tags->rqs. >>>>> >>>>> blk_mq_get_request blk_mq_queue_tag_busy_iter >>>>> -> blk_mq_get_tag >>>>> -> bt_for_each >>>>> -> bt_iter >>>>> -> rq = taags->rqs[] >>>>> -> rq->q >>>>> -> blk_mq_rq_ctx_init >>>>> -> data->hctx->tags->rqs[rq->tag] = rq; >>>>> >>>>> If the scenario is possible, maybe we could fix it as following. >>>> >>>> Ah yes, good point, I bet that's what it is. But we just had this exact >>>> discussion in another thread, and my point there was that we should >>>> clear these when they go away, not inline. So how about clearing entries >>>> when the sched tags go away? >>>> >>> I guess it should be OK. :) >> >> Something like this. Totally untested... And I wonder if there's a more >> efficient way to do this, not that it matters THAT much. But still. >> >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 2de972857496..341cb8b9cfb7 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -2025,16 +2025,21 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, >> { >> struct page *page; >> >> - if (tags->rqs && set->ops->exit_request) { >> - int i; >> + if (tags->rqs) { >> + int i, j; >> >> for (i = 0; i < tags->nr_tags; i++) { >> struct request *rq = tags->static_rqs[i]; >> >> if (!rq) >> continue; >> - set->ops->exit_request(set, rq, hctx_idx); >> + if (set->ops->exit_request) >> + set->ops->exit_request(set, rq, hctx_idx); >> tags->static_rqs[i] = NULL; >> + >> + for (j = 0; j < tags->nr_tags; j++) >> + if (tags->rqs[j] == rq) >> + tags->rqs[j] = NULL; >> } >> } >> >> > > I'm afraid this cannot work. > > The 'tags' here could be the hctx->sched_tags, but what we need to > clear is hctx->tags->rqs[]. You are right, of course, a bit too quick on the trigger. This one should work better, and also avoids that silly quadratic loop. I don't think we need the tag == -1 check, but probably best to be safe. diff --git a/block/blk-mq.c b/block/blk-mq.c index 2de972857496..151891eb6fbd 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2025,16 +2025,22 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, { struct page *page; - if (tags->rqs && set->ops->exit_request) { - int i; + if (tags->rqs) { + int i, j; for (i = 0; i < tags->nr_tags; i++) { struct request *rq = tags->static_rqs[i]; if (!rq) continue; - set->ops->exit_request(set, rq, hctx_idx); + if (set->ops->exit_request) + set->ops->exit_request(set, rq, hctx_idx); tags->static_rqs[i] = NULL; + + if (rq->tag == -1) + continue; + if (set->tags[hctx_idx]->rqs[rq->tag] == rq) + set->tags[hctx_idx]->rqs[rq->tag] = NULL; } } -- Jens Axboe