Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive

Ming Lei <ming.lei@xxxxxxxxxx> · Thu, 30 Apr 2020 22:02:54 +0800

On Thu, Apr 30, 2020 at 12:04:29PM +0100, Will Deacon wrote:
> On Thu, Apr 30, 2020 at 08:39:45AM +0800, Ming Lei wrote:
> > On Wed, Apr 29, 2020 at 06:34:01PM +0100, Will Deacon wrote:
> > > On Wed, Apr 29, 2020 at 09:43:27PM +0800, Ming Lei wrote:
> > > > Please see the following two code paths:
> > > > 
> > > > [1] code path1:
> > > > blk_mq_hctx_notify_offline():
> > > > 	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> > > > 
> > > > 	smp_mb() or smp_mb_after_atomic()
> > > > 
> > > > 	blk_mq_hctx_drain_inflight_rqs():
> > > > 		blk_mq_tags_inflight_rqs()
> > > > 			rq = hctx->tags->rqs[index]
> > > > 			and
> > > > 			READ rq->tag
> > > > 
> > > > [2] code path2:
> > > > 	blk_mq_get_driver_tag():
> > > > 
> > > > 		process might be migrated to other CPU here and chance is small,
> > > > 		then the follow code will be run on CPU different with code path1
> > > > 
> > > > 		rq->tag = rq->internal_tag;
> > > > 		hctx->tags->rqs[rq->tag] = rq;
> > > 
> > > I /think/ this can be distilled to the SB litmus test:
> > > 
> > > 	// blk_mq_hctx_notify_offline()		blk_mq_get_driver_tag();
> > > 	Wstate = INACTIVE			Wtag
> > > 	smp_mb()				smp_mb()
> > > 	Rtag					Rstate
> > > 
> > > and you want to make sure that either blk_mq_get_driver_tag() sees the
> > > state as INACTIVE and does the cleanup, or it doesn't and
> > > blk_mq_hctx_notify_offline() sees the newly written tag and waits for the
> > > request to complete (I don't get how that happens, but hey).
> > > 
> > > Is that right?
> > 
> > Yeah, exactly.
> > 
> > > 
> > > > 		barrier() in case that code path2 is run on same CPU with code path1
> > > > 		OR
> > > > 		smp_mb() in case that code path2 is run on different CPU with code path1 because
> > > > 		of process migration
> > > > 		
> > > > 		test_bit(BLK_MQ_S_INACTIVE, &data.hctx->state)
> > > 
> > > Couldn't you just check this at the start of blk_mq_get_driver_tag() as
> > > well, and then make the smp_mb() unconditional?
> > 
> > As I mentioned, the chance for the current process(calling
> > blk_mq_get_driver_tag()) migration is very small, we do want to
> > avoid the extra smp_mb() in the fast path.
> 
> Hmm, but your suggestion of checking 'rq->mq_ctx->cpu' only works if that
> is the same CPU on which blk_mq_hctx_notify_offline() executes. What
> provides that guarantee?

BLK_MQ_S_INACTIVE is only set when the last cpu of this hctx is becoming
offline, and blk_mq_hctx_notify_offline() is called from cpu hotplug
handler. So if there is any request of this hctx submitted from somewhere,
it has to this last cpu. That is done by blk-mq's queue mapping.

In case of direct issue, basically blk_mq_get_driver_tag() is run after
the request is allocated, that is why I mentioned the chance of
migration is very small.

> 
> If there's any chance of this thing being concurrent, then you need the

The only chance is that the process running blk_mq_get_driver_tag is
migrated to another CPU in case of direct issue. And we do add
smp_mb() for this case.

> barrier there just in case. So I'd say you either need to prevent the race,
> or live with the barrier. Do you have numbers to show how expensive it is?

Not yet, but we can save it easily in the very fast path, so why not do it?
Especially most of times preemption won't happen at all.

Also this patch itself is correct, and preempt disable via get_cpu()
suggested by Christoph isn't needed too, because migration implies
smp_mb(). I will document this point in next version.

Thanks,
Ming