Re: [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed

Hannes Reinecke <hare@xxxxxxx> · Wed, 24 Apr 2019 07:55:02 +0200

On 4/24/19 3:45 AM, Ming Lei wrote:
On Wed, Apr 24, 2019 at 09:12:42AM +0800, Ming Lei wrote:
On Tue, Apr 23, 2019 at 04:07:49PM +0200, Hannes Reinecke wrote:
On 4/23/19 3:30 PM, Ming Lei wrote:
Hi Hannes,

Thanks for your response.

On Tue, Apr 23, 2019 at 01:19:33PM +0200, Hannes Reinecke wrote:
On 4/22/19 5:30 AM, Ming Lei wrote:
[ .. ]

Hi Hannes,

Could you please let us know if you have better idea for this issue?
Otherwise, I think we need to move on since it is real issue, and users
want to fix that.

Okay. Having looked over the problem and possible alternatives, it looks
indeed like a viable solution.
I do agree that it's a sensible design to have an additional holding area
for hardware context elements, given that they might be reassigned during
blk_mq_realloc_hw_ctxs().

However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list
etc).

OK, looks the name of 'unused' is better.


And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things
more consistent.

No, that is wrong.

The request queue's refcount is often held when blk_cleanup_queue() is running,
and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant
is that we have to allow most APIs running well if the request queue is live
from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(),
it is quite easy to cause use-after-free.

Ah. Thought as much.
But then in most cases the ->queue_hw_ctx pointer is immaterial as we're
accessing things via the hctx pointer, which remains valid.

Problem with the current patch is that in blk_mq_release() we iterate
the 'dead_hctx_list' and free up everything in there, but then blindly call
'kfree(q->queue_hw_ctx)' without checking if there are any context pointers
left.

If request queue is dead, it is safe to assume that there isn't any
reference to request queue and q->queue_hw_ctx. Otherwise, it must be
a bug somewhere.

Precisely.
What I'm trying to achieve with this is to protect against such issues,
which are quite easy to introduce given the complexity of the code ...

But releasing q->queue_hw_ctx in blk_cleanup_queue() is easy to cause
use-after-free even though the request queue's refcount is held. We can't
do that simply.

If someone is still trying to use q->queue_hw_ctx[] after the request
queue is dead, the bug is in the caller of block layer API, not in
block layer.

What the patchset is trying to fix is the race in block layer, not
users of block layer, not drivers. So far, I don't see such driver
issue.

Just thought q->queue_hw_ctx as the request queue's resource, you will
see it is pretty reasonable to free q->queue_hw_ctx in the queue's
release handler.


When moving the call to blk_mq_exit_queue() we could to a simple
WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got
deallocated properly.

At that time, hctx instance might be active, but that is fine given hctx
is covered by its own kobject. What we need to do is to make sure that no
any references to q->queue_hw_ctx and the request queue.

My point being here:
void blk_mq_release(struct request_queue *q)
{
         struct blk_mq_hw_ctx *hctx, *next;

         cancel_delayed_work_sync(&q->requeue_work);

         /* all hctx are in .dead_hctx_list now */
         list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list)
{
                 list_del_init(&hctx->hctx_list);
                 kobject_put(&hctx->kobj);
         }

         kfree(q->queue_hw_ctx);

         /*
          * release .mq_kobj and sw queue's kobject now because
          * both share lifetime with request queue.
          */
         blk_mq_sysfs_deinit(q);
}

This assumes that _all_ hctx pointers are being removed from
q->queue_hw_ctx, and are moved to the 'dead' list.
If for some reason this is not the case we'll be leaking hctx pointers here.

IMO, there aren't such some reasons. When blk_mq_release() is called,
every hctx of this request queue has been "exited" via blk_mq_exit_hctx(),
either from blk_mq_update_nr_hw_queues() or blk_cleanup_queue().

If there are hctxs not moved to the 'dead'(or 'unused') list here, it is
simply a bug in blk-mq. However, I don't see such case now.

Or we can add the following check in blk_mq_release() for capturing
the impossible blk-mq bug:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2ca4395f794d..f0d08087b8f6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2366,6 +2366,8 @@ blk_mq_alloc_hctx(struct request_queue *q,
  	hctx->queue = q;
  	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
  
+	INIT_LIST_HEAD(&hctx->hctx_list);
+
  	/*
  	 * Allocate space for all possible cpus to avoid allocation at
  	 * runtime
@@ -2680,9 +2682,14 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
  void blk_mq_release(struct request_queue *q)
  {
  	struct blk_mq_hw_ctx *hctx, *next;
+	int i;
  
  	cancel_delayed_work_sync(&q->requeue_work);
  
+	/* warn if live hctx is found, that shouldn't happen */
+	queue_for_each_hw_ctx(q, hctx, i)
+		WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
+
  	/* all hctx are in .dead_hctx_list now */
  	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
  		list_del_init(&hctx->hctx_list);

Precisely, that would've been my other choice of fixing it, but I didn't 
as I tried to avoid the loop. But if you agree then I'm fine with it, too.

Please add this hunk to the patch.

Cheers,

Hannes
--
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@xxxxxxx			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)