Re: [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/21/21 7:25 PM, Ming Lei wrote:
> On Tue, Apr 20, 2021 at 05:02:33PM -0700, Bart Van Assche wrote:
>> When multiple request queues share a tag set and when switching the I/O
>> scheduler for one of the request queues associated with that tag set, the
>> following race can happen:
>> * blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
>>   a pointer to a scheduler request to the local variable 'rq'.
>> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
>> * blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.
>>
>> Fix this race as follows:
>> * Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
>>   The performance impact of the assignments added to the hot path is minimal
>>   (about 1% according to one particular test).
>> * Protect hctx->tags->rqs[] reads with an RCU read-side lock or with a
>>   semaphore. Which mechanism is used depends on whether or not it is allowed
>>   to sleep and also on whether or not the callback function may sleep.
>> * Wait for all preexisting readers to finish before freeing scheduler
>>   requests.
>>
>> Another race is as follows:
>> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
>> * blk_mq_queue_tag_busy_iter() iterates over the same tag set but for another
>>   request queue than the queue for which scheduler tags are being freed.
>> * bt_iter() examines rq->q after *rq has been freed.
>>
>> Fix this race by protecting the rq->q read in bt_iter() with an RCU read
>> lock and by calling synchronize_rcu() before freeing scheduler tags.
>>
>> Multiple users have reported use-after-free complaints similar to the
>> following (from https://lore.kernel.org/linux-block/1545261885.185366.488.camel@xxxxxxx/ ):
>>
>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
>> Read of size 8 at addr ffff88803b335240 by task fio/21412
>>
>> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>> Call Trace:
>>  dump_stack+0x86/0xca
>>  print_address_description+0x71/0x239
>>  kasan_report.cold.5+0x242/0x301
>>  __asan_load8+0x54/0x90
>>  bt_iter+0x86/0xf0
>>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>>  blk_mq_in_flight+0x96/0xb0
>>  part_in_flight+0x40/0x140
>>  part_round_stats+0x18e/0x370
>>  blk_account_io_start+0x3d7/0x670
>>  blk_mq_bio_to_request+0x19c/0x3a0
>>  blk_mq_make_request+0x7a9/0xcb0
>>  generic_make_request+0x41d/0x960
>>  submit_bio+0x9b/0x250
>>  do_blockdev_direct_IO+0x435c/0x4c70
>>  __blockdev_direct_IO+0x79/0x88
>>  ext4_direct_IO+0x46c/0xc00
>>  generic_file_direct_write+0x119/0x210
>>  __generic_file_write_iter+0x11c/0x280
>>  ext4_file_write_iter+0x1b8/0x6f0
>>  aio_write+0x204/0x310
>>  io_submit_one+0x9d3/0xe80
>>  __x64_sys_io_submit+0x115/0x340
>>  do_syscall_64+0x71/0x210
>>
>> Reviewed-by: Khazhismel Kumykov <khazhy@xxxxxxxxxx>
>> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@xxxxxxx>
>> Cc: Christoph Hellwig <hch@xxxxxx>
>> Cc: Martin K. Petersen <martin.petersen@xxxxxxxxxx>
>> Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@xxxxxxx>
>> Cc: Ming Lei <ming.lei@xxxxxxxxxx>
>> Cc: Hannes Reinecke <hare@xxxxxxx>
>> Cc: Johannes Thumshirn <johannes.thumshirn@xxxxxxx>
>> Cc: John Garry <john.garry@xxxxxxxxxx>
>> Cc: Khazhy Kumykov <khazhy@xxxxxxxxxx>
>> Signed-off-by: Bart Van Assche <bvanassche@xxxxxxx>
>> ---
>>  block/blk-core.c   | 34 ++++++++++++++++++++++++++++++-
>>  block/blk-mq-tag.c | 51 ++++++++++++++++++++++++++++++++++++++++------
>>  block/blk-mq-tag.h |  4 +++-
>>  block/blk-mq.c     | 21 +++++++++++++++----
>>  block/blk-mq.h     |  1 +
>>  block/blk.h        |  2 ++
>>  block/elevator.c   |  1 +
>>  7 files changed, 102 insertions(+), 12 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 9bcdae93f6d4..ca7f833e25a8 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -279,6 +279,36 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
>>  }
>>  EXPORT_SYMBOL(blk_dump_rq_flags);
>>  
>> +/**
>> + * blk_mq_wait_for_tag_iter - wait for preexisting tag iteration functions to finish
>> + * @set: Tag set to wait on.
>> + *
>> + * Waits for preexisting calls of blk_mq_all_tag_iter(),
>> + * blk_mq_tagset_busy_iter() and also for their atomic variants to finish
>> + * accessing hctx->tags->rqs[]. New readers may start while this function is
>> + * in progress or after this function has returned. Use this function to make
>> + * sure that hctx->tags->rqs[] changes have become globally visible.
>> + *
>> + * Waits for preexisting blk_mq_queue_tag_busy_iter(q, fn, priv) calls to
>> + * finish accessing requests associated with other request queues than 'q'.
>> + */
>> +void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set)
>> +{
>> +	struct blk_mq_tags *tags;
>> +	int i;
>> +
>> +	if (set->tags) {
>> +		for (i = 0; i < set->nr_hw_queues; i++) {
>> +			tags = set->tags[i];
>> +			if (!tags)
>> +				continue;
>> +			down_write(&tags->iter_rwsem);
>> +			up_write(&tags->iter_rwsem);
>> +		}
>> +	}
>> +	synchronize_rcu();
>> +}
>> +
>>  /**
>>   * blk_sync_queue - cancel any pending callbacks on a queue
>>   * @q: the queue
>> @@ -412,8 +442,10 @@ void blk_cleanup_queue(struct request_queue *q)
>>  	 * it is safe to free requests now.
>>  	 */
>>  	mutex_lock(&q->sysfs_lock);
>> -	if (q->elevator)
>> +	if (q->elevator) {
>> +		blk_mq_wait_for_tag_iter(q->tag_set);
>>  		blk_mq_sched_free_requests(q);
>> +	}
>>  	mutex_unlock(&q->sysfs_lock);
>>  
>>  	percpu_ref_exit(&q->q_usage_counter);
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index d8eaa38a1bd1..39d5c9190a6b 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -209,14 +209,24 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>  
>>  	if (!reserved)
>>  		bitnr += tags->nr_reserved_tags;
>> -	rq = tags->rqs[bitnr];
>> -
>> +	rcu_read_lock();
>> +	/*
>> +	 * The request 'rq' points at is protected by an RCU read lock until
>> +	 * its queue pointer has been verified and by q_usage_count while the
>> +	 * callback function is being invoked. See also the
>> +	 * percpu_ref_tryget() and blk_queue_exit() calls in
>> +	 * blk_mq_queue_tag_busy_iter().
>> +	 */
>> +	rq = rcu_dereference(tags->rqs[bitnr]);
>>  	/*
>>  	 * We can hit rq == NULL here, because the tagging functions
>>  	 * test and set the bit before assigning ->rqs[].
>>  	 */
>> -	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
>> +	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) {
>> +		rcu_read_unlock();
>>  		return iter_data->fn(hctx, rq, iter_data->data, reserved);
>> +	}
>> +	rcu_read_unlock();
>>  	return true;
>>  }
>>  
>> @@ -254,11 +264,17 @@ struct bt_tags_iter_data {
>>  	unsigned int flags;
>>  };
>>  
>> +/* Include reserved tags. */
>>  #define BT_TAG_ITER_RESERVED		(1 << 0)
>> +/* Only include started requests. */
>>  #define BT_TAG_ITER_STARTED		(1 << 1)
>> +/* Iterate over tags->static_rqs[] instead of tags->rqs[]. */
>>  #define BT_TAG_ITER_STATIC_RQS		(1 << 2)
>> +/* The callback function may sleep. */
>> +#define BT_TAG_ITER_MAY_SLEEP		(1 << 3)
>>  
>> -static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>> +static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr,
>> +			   void *data)
>>  {
>>  	struct bt_tags_iter_data *iter_data = data;
>>  	struct blk_mq_tags *tags = iter_data->tags;
>> @@ -275,7 +291,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>  	if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
>>  		rq = tags->static_rqs[bitnr];
>>  	else
>> -		rq = tags->rqs[bitnr];
>> +		rq = rcu_dereference_check(tags->rqs[bitnr],
>> +					   lockdep_is_held(&tags->iter_rwsem));
>>  	if (!rq)
>>  		return true;
>>  	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
>> @@ -284,6 +301,25 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>  	return iter_data->fn(rq, iter_data->data, reserved);
>>  }
>>  
>> +static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>> +{
>> +	struct bt_tags_iter_data *iter_data = data;
>> +	struct blk_mq_tags *tags = iter_data->tags;
>> +	bool res;
>> +
>> +	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
>> +		down_read(&tags->iter_rwsem);
>> +		res = __bt_tags_iter(bitmap, bitnr, data);
>> +		up_read(&tags->iter_rwsem);
>> +	} else {
>> +		rcu_read_lock();
>> +		res = __bt_tags_iter(bitmap, bitnr, data);
>> +		rcu_read_unlock();
>> +	}
>> +
>> +	return res;
>> +}
>> +
>>  /**
>>   * bt_tags_for_each - iterate over the requests in a tag map
>>   * @tags:	Tag map to iterate over.
>> @@ -357,10 +393,12 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>>  {
>>  	int i;
>>  
>> +	might_sleep();
>> +
>>  	for (i = 0; i < tagset->nr_hw_queues; i++) {
>>  		if (tagset->tags && tagset->tags[i])
>>  			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>> -					      BT_TAG_ITER_STARTED);
>> +				BT_TAG_ITER_STARTED | BT_TAG_ITER_MAY_SLEEP);
>>  	}
>>  }
>>  EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
>> @@ -544,6 +582,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
>>  
>>  	tags->nr_tags = total_tags;
>>  	tags->nr_reserved_tags = reserved_tags;
>> +	init_rwsem(&tags->iter_rwsem);
>>  
>>  	if (blk_mq_is_sbitmap_shared(flags))
>>  		return tags;
>> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
>> index 0290c308ece9..d1d73d7cc7df 100644
>> --- a/block/blk-mq-tag.h
>> +++ b/block/blk-mq-tag.h
>> @@ -17,9 +17,11 @@ struct blk_mq_tags {
>>  	struct sbitmap_queue __bitmap_tags;
>>  	struct sbitmap_queue __breserved_tags;
>>  
>> -	struct request **rqs;
>> +	struct request __rcu **rqs;
>>  	struct request **static_rqs;
>>  	struct list_head page_list;
>> +
>> +	struct rw_semaphore iter_rwsem;
>>  };
>>  
>>  extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 79c01b1f885c..8b59f6b4ec8e 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -496,8 +496,10 @@ static void __blk_mq_free_request(struct request *rq)
>>  	blk_crypto_free_request(rq);
>>  	blk_pm_mark_last_busy(rq);
>>  	rq->mq_hctx = NULL;
>> -	if (rq->tag != BLK_MQ_NO_TAG)
>> +	if (rq->tag != BLK_MQ_NO_TAG) {
>>  		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
>> +		rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
> 
> After the tag is released, it can be re-allocated from another context
> immediately. And the above rcu_assign_pointer(NULL) could be run after
> the re-allocation and new ->rqs[rq->tag] assignment in submission path,
> is it one issue?

How about swapping the rcu_assign_pointer() and blk_mq_put_tag() calls?
That should be safe since the tag iteration functions check whether or
not hctx->tags->rqs[] is NULL.

There are already barriers in sbitmap_queue_clear() so I don't think
that any new barriers would need to be added.

Thanks,

Bart.




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux