On 4/21/21 7:25 PM, Ming Lei wrote: > On Tue, Apr 20, 2021 at 05:02:33PM -0700, Bart Van Assche wrote: >> When multiple request queues share a tag set and when switching the I/O >> scheduler for one of the request queues associated with that tag set, the >> following race can happen: >> * blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns >> a pointer to a scheduler request to the local variable 'rq'. >> * blk_mq_sched_free_requests() is called to free hctx->sched_tags. >> * blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free. >> >> Fix this race as follows: >> * Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[]. >> The performance impact of the assignments added to the hot path is minimal >> (about 1% according to one particular test). >> * Protect hctx->tags->rqs[] reads with an RCU read-side lock or with a >> semaphore. Which mechanism is used depends on whether or not it is allowed >> to sleep and also on whether or not the callback function may sleep. >> * Wait for all preexisting readers to finish before freeing scheduler >> requests. >> >> Another race is as follows: >> * blk_mq_sched_free_requests() is called to free hctx->sched_tags. >> * blk_mq_queue_tag_busy_iter() iterates over the same tag set but for another >> request queue than the queue for which scheduler tags are being freed. >> * bt_iter() examines rq->q after *rq has been freed. >> >> Fix this race by protecting the rq->q read in bt_iter() with an RCU read >> lock and by calling synchronize_rcu() before freeing scheduler tags. >> >> Multiple users have reported use-after-free complaints similar to the >> following (from https://lore.kernel.org/linux-block/1545261885.185366.488.camel@xxxxxxx/ ): >> >> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 >> Read of size 8 at addr ffff88803b335240 by task fio/21412 >> >> CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 >> Call Trace: >> dump_stack+0x86/0xca >> print_address_description+0x71/0x239 >> kasan_report.cold.5+0x242/0x301 >> __asan_load8+0x54/0x90 >> bt_iter+0x86/0xf0 >> blk_mq_queue_tag_busy_iter+0x373/0x5e0 >> blk_mq_in_flight+0x96/0xb0 >> part_in_flight+0x40/0x140 >> part_round_stats+0x18e/0x370 >> blk_account_io_start+0x3d7/0x670 >> blk_mq_bio_to_request+0x19c/0x3a0 >> blk_mq_make_request+0x7a9/0xcb0 >> generic_make_request+0x41d/0x960 >> submit_bio+0x9b/0x250 >> do_blockdev_direct_IO+0x435c/0x4c70 >> __blockdev_direct_IO+0x79/0x88 >> ext4_direct_IO+0x46c/0xc00 >> generic_file_direct_write+0x119/0x210 >> __generic_file_write_iter+0x11c/0x280 >> ext4_file_write_iter+0x1b8/0x6f0 >> aio_write+0x204/0x310 >> io_submit_one+0x9d3/0xe80 >> __x64_sys_io_submit+0x115/0x340 >> do_syscall_64+0x71/0x210 >> >> Reviewed-by: Khazhismel Kumykov <khazhy@xxxxxxxxxx> >> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@xxxxxxx> >> Cc: Christoph Hellwig <hch@xxxxxx> >> Cc: Martin K. Petersen <martin.petersen@xxxxxxxxxx> >> Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@xxxxxxx> >> Cc: Ming Lei <ming.lei@xxxxxxxxxx> >> Cc: Hannes Reinecke <hare@xxxxxxx> >> Cc: Johannes Thumshirn <johannes.thumshirn@xxxxxxx> >> Cc: John Garry <john.garry@xxxxxxxxxx> >> Cc: Khazhy Kumykov <khazhy@xxxxxxxxxx> >> Signed-off-by: Bart Van Assche <bvanassche@xxxxxxx> >> --- >> block/blk-core.c | 34 ++++++++++++++++++++++++++++++- >> block/blk-mq-tag.c | 51 ++++++++++++++++++++++++++++++++++++++++------ >> block/blk-mq-tag.h | 4 +++- >> block/blk-mq.c | 21 +++++++++++++++---- >> block/blk-mq.h | 1 + >> block/blk.h | 2 ++ >> block/elevator.c | 1 + >> 7 files changed, 102 insertions(+), 12 deletions(-) >> >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 9bcdae93f6d4..ca7f833e25a8 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -279,6 +279,36 @@ void blk_dump_rq_flags(struct request *rq, char *msg) >> } >> EXPORT_SYMBOL(blk_dump_rq_flags); >> >> +/** >> + * blk_mq_wait_for_tag_iter - wait for preexisting tag iteration functions to finish >> + * @set: Tag set to wait on. >> + * >> + * Waits for preexisting calls of blk_mq_all_tag_iter(), >> + * blk_mq_tagset_busy_iter() and also for their atomic variants to finish >> + * accessing hctx->tags->rqs[]. New readers may start while this function is >> + * in progress or after this function has returned. Use this function to make >> + * sure that hctx->tags->rqs[] changes have become globally visible. >> + * >> + * Waits for preexisting blk_mq_queue_tag_busy_iter(q, fn, priv) calls to >> + * finish accessing requests associated with other request queues than 'q'. >> + */ >> +void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set) >> +{ >> + struct blk_mq_tags *tags; >> + int i; >> + >> + if (set->tags) { >> + for (i = 0; i < set->nr_hw_queues; i++) { >> + tags = set->tags[i]; >> + if (!tags) >> + continue; >> + down_write(&tags->iter_rwsem); >> + up_write(&tags->iter_rwsem); >> + } >> + } >> + synchronize_rcu(); >> +} >> + >> /** >> * blk_sync_queue - cancel any pending callbacks on a queue >> * @q: the queue >> @@ -412,8 +442,10 @@ void blk_cleanup_queue(struct request_queue *q) >> * it is safe to free requests now. >> */ >> mutex_lock(&q->sysfs_lock); >> - if (q->elevator) >> + if (q->elevator) { >> + blk_mq_wait_for_tag_iter(q->tag_set); >> blk_mq_sched_free_requests(q); >> + } >> mutex_unlock(&q->sysfs_lock); >> >> percpu_ref_exit(&q->q_usage_counter); >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c >> index d8eaa38a1bd1..39d5c9190a6b 100644 >> --- a/block/blk-mq-tag.c >> +++ b/block/blk-mq-tag.c >> @@ -209,14 +209,24 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) >> >> if (!reserved) >> bitnr += tags->nr_reserved_tags; >> - rq = tags->rqs[bitnr]; >> - >> + rcu_read_lock(); >> + /* >> + * The request 'rq' points at is protected by an RCU read lock until >> + * its queue pointer has been verified and by q_usage_count while the >> + * callback function is being invoked. See also the >> + * percpu_ref_tryget() and blk_queue_exit() calls in >> + * blk_mq_queue_tag_busy_iter(). >> + */ >> + rq = rcu_dereference(tags->rqs[bitnr]); >> /* >> * We can hit rq == NULL here, because the tagging functions >> * test and set the bit before assigning ->rqs[]. >> */ >> - if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) >> + if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) { >> + rcu_read_unlock(); >> return iter_data->fn(hctx, rq, iter_data->data, reserved); >> + } >> + rcu_read_unlock(); >> return true; >> } >> >> @@ -254,11 +264,17 @@ struct bt_tags_iter_data { >> unsigned int flags; >> }; >> >> +/* Include reserved tags. */ >> #define BT_TAG_ITER_RESERVED (1 << 0) >> +/* Only include started requests. */ >> #define BT_TAG_ITER_STARTED (1 << 1) >> +/* Iterate over tags->static_rqs[] instead of tags->rqs[]. */ >> #define BT_TAG_ITER_STATIC_RQS (1 << 2) >> +/* The callback function may sleep. */ >> +#define BT_TAG_ITER_MAY_SLEEP (1 << 3) >> >> -static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) >> +static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, >> + void *data) >> { >> struct bt_tags_iter_data *iter_data = data; >> struct blk_mq_tags *tags = iter_data->tags; >> @@ -275,7 +291,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) >> if (iter_data->flags & BT_TAG_ITER_STATIC_RQS) >> rq = tags->static_rqs[bitnr]; >> else >> - rq = tags->rqs[bitnr]; >> + rq = rcu_dereference_check(tags->rqs[bitnr], >> + lockdep_is_held(&tags->iter_rwsem)); >> if (!rq) >> return true; >> if ((iter_data->flags & BT_TAG_ITER_STARTED) && >> @@ -284,6 +301,25 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) >> return iter_data->fn(rq, iter_data->data, reserved); >> } >> >> +static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) >> +{ >> + struct bt_tags_iter_data *iter_data = data; >> + struct blk_mq_tags *tags = iter_data->tags; >> + bool res; >> + >> + if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) { >> + down_read(&tags->iter_rwsem); >> + res = __bt_tags_iter(bitmap, bitnr, data); >> + up_read(&tags->iter_rwsem); >> + } else { >> + rcu_read_lock(); >> + res = __bt_tags_iter(bitmap, bitnr, data); >> + rcu_read_unlock(); >> + } >> + >> + return res; >> +} >> + >> /** >> * bt_tags_for_each - iterate over the requests in a tag map >> * @tags: Tag map to iterate over. >> @@ -357,10 +393,12 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, >> { >> int i; >> >> + might_sleep(); >> + >> for (i = 0; i < tagset->nr_hw_queues; i++) { >> if (tagset->tags && tagset->tags[i]) >> __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, >> - BT_TAG_ITER_STARTED); >> + BT_TAG_ITER_STARTED | BT_TAG_ITER_MAY_SLEEP); >> } >> } >> EXPORT_SYMBOL(blk_mq_tagset_busy_iter); >> @@ -544,6 +582,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, >> >> tags->nr_tags = total_tags; >> tags->nr_reserved_tags = reserved_tags; >> + init_rwsem(&tags->iter_rwsem); >> >> if (blk_mq_is_sbitmap_shared(flags)) >> return tags; >> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h >> index 0290c308ece9..d1d73d7cc7df 100644 >> --- a/block/blk-mq-tag.h >> +++ b/block/blk-mq-tag.h >> @@ -17,9 +17,11 @@ struct blk_mq_tags { >> struct sbitmap_queue __bitmap_tags; >> struct sbitmap_queue __breserved_tags; >> >> - struct request **rqs; >> + struct request __rcu **rqs; >> struct request **static_rqs; >> struct list_head page_list; >> + >> + struct rw_semaphore iter_rwsem; >> }; >> >> extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 79c01b1f885c..8b59f6b4ec8e 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -496,8 +496,10 @@ static void __blk_mq_free_request(struct request *rq) >> blk_crypto_free_request(rq); >> blk_pm_mark_last_busy(rq); >> rq->mq_hctx = NULL; >> - if (rq->tag != BLK_MQ_NO_TAG) >> + if (rq->tag != BLK_MQ_NO_TAG) { >> blk_mq_put_tag(hctx->tags, ctx, rq->tag); >> + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL); > > After the tag is released, it can be re-allocated from another context > immediately. And the above rcu_assign_pointer(NULL) could be run after > the re-allocation and new ->rqs[rq->tag] assignment in submission path, > is it one issue? How about swapping the rcu_assign_pointer() and blk_mq_put_tag() calls? That should be safe since the tag iteration functions check whether or not hctx->tags->rqs[] is NULL. There are already barriers in sbitmap_queue_clear() so I don't think that any new barriers would need to be added. Thanks, Bart.