Re: [PATCH 5/9] blk-mq: don't set data->ctx and data->hctx in blk_mq_alloc_request_hctx

Ming Lei <ming.lei@xxxxxxxxxx> · Tue, 19 May 2020 09:54:20 +0800

On Mon, May 18, 2020 at 06:56:19PM +0200, Christoph Hellwig wrote:
> On Mon, May 18, 2020 at 10:11:07PM +0800, Ming Lei wrote:
> > On Mon, May 18, 2020 at 03:16:34PM +0200, Christoph Hellwig wrote:
> > > On Mon, May 18, 2020 at 07:54:54PM +0800, Ming Lei wrote:
> > > > 
> > > > I guess I misunderstood your point, sorry for that.
> > > > 
> > > > The requirement is just that the request needs to be allocated on one online
> > > > CPU after INACTIVE is set, and we can use a workqueue to do that.
> > > 
> > > I've looked over the code again, and I'm really not sure why we need that.
> > > Presumable the CPU hotplug code ensures tasks don't get schedule on the
> > > CPU running the shutdown state machine, so if we just do a retry of the
> > 
> > percpu kthread still can be scheduled on the cpu to be online, see
> > is_cpu_allowed(). And bound wq has been used widely in fs code.
> 
> s/to be online/to be offlined/ I guess.
> 
> Shouldn't all the per-cpu kthreads also stop as part of the offlining?
> If they don't quiesce before the new blk-mq stop state I think we need
> to make sure they do.  It is rather pointless to quiesce the requests
> if a thread that can submit I/O is still live.

As Thomas clarified, workqueue hasn't such issue any more, and only other
per CPU kthreads can run until the CPU clears the online bit.

So the question is if IO can be submitted from such kernel context?

> 
> > > 
> > >     http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-mq-hotplug.2
> > > 
> > 
> > After preempt_disable() is removed, the INACTIVE bit is set in the CPU
> > to be offline, and the bit can be read on all other CPUs, so other CPUs
> > may not get synced with the INACTIVE bit.
> 
> INACTIVE is set to the hctx, and it is set by the last CPU to be
> offlined that is mapped to the hctx.  once the bit is set the barrier
> ensured it is seen everywhere before we start waiting for the requests
> to finish.  What is missing?:

memory barrier should always be used as pair, and you should have mentioned
that the implied barrier in test_and_set_bit_lock pair from sbitmap_get()
is pair of smp_mb__after_atomic() in blk_mq_hctx_notify_offline().

Then setting tag bit and checking INACTIVE in blk_mq_get_tag() can be ordered,
same with setting INACTIVE and checking tag bit in blk_mq_hctx_notify_offline().

Then such usage is basically same with previous use of smp_mb() in blk_mq_get_driver_tag()
in earlier version.

BTW, smp_mb__before_atomic() in blk_mq_hctx_notify_offline() isn't needed.

Thanks,
Ming