Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive

Ming Lei <ming.lei@xxxxxxxxxx> · Wed, 6 May 2020 16:07:27 +0800

On Wed, May 06, 2020 at 08:28:03AM +0100, Will Deacon wrote:
> On Wed, May 06, 2020 at 09:24:25AM +0800, Ming Lei wrote:
> > On Tue, May 05, 2020 at 05:46:18PM +0200, Christoph Hellwig wrote:
> > > On Thu, Apr 30, 2020 at 10:02:54PM +0800, Ming Lei wrote:
> > > > BLK_MQ_S_INACTIVE is only set when the last cpu of this hctx is becoming
> > > > offline, and blk_mq_hctx_notify_offline() is called from cpu hotplug
> > > > handler. So if there is any request of this hctx submitted from somewhere,
> > > > it has to this last cpu. That is done by blk-mq's queue mapping.
> > > > 
> > > > In case of direct issue, basically blk_mq_get_driver_tag() is run after
> > > > the request is allocated, that is why I mentioned the chance of
> > > > migration is very small.
> > > 
> > > "very small" does not cut it, it has to be zero.  And it seems the
> > > new version still has this hack.
> > 
> > But smp_mb() is used for ordering the WRITE and READ, so it is correct.
> > 
> > barrier() is enough when process migration doesn't happen.
> 
> Without numbers I would just make the smp_mb() unconditional. Your
> questionable optimisation trades that for a load of the CPU ID and a
> conditional branch, which isn't obviously faster to me. It's also very

The CPU ID is just percpu READ, and unlikely() has been used for
optimizing the conditional branch. And smp_mb() could cause CPU stall, I
guess, so it should be much slower than reading CPU ID.

Let's see the attached microbench[1], the result shows that smp_mb() is
10+ times slower than smp_processor_id() with one conditional branch.

[    1.239951] test_foo: smp_mb 738701907 smp_id 62904315 result 0 overflow 5120

The micronbench is run on simple 8cores KVM guest, and cpu is
'Model name:          Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz'.

Result is pretty stable in my 5 runs of VM boot.

> difficult to explain to people and relies on a bunch of implicit behaviour
> (e.g. racing only with CPU-affine hotplug notifier).

It can be documented easily.

> 
> If it turns out that the smp_mb() is worthwhile,  then I'd suggest improving
> the comment, perhaps to include the litmus test I cooked previously.

I have added big comment on this usage in V10 already.



[1] miscrobench

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 956106b01810..548eec11f922 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3836,8 +3836,47 @@ unsigned int blk_mq_rq_cpu(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_rq_cpu);
 
+static unsigned long test_smp_mb(unsigned long cnt)
+{
+	unsigned long start = local_clock();
+
+	while (cnt--)
+		smp_mb();
+
+	return local_clock() - start;
+}
+
+static unsigned long test_smp_id(unsigned long cnt, short *result, int *overflow)
+{
+	unsigned long start = local_clock();
+
+	while (cnt--) {
+		short cpu = smp_processor_id();
+		*result += cpu;
+		if (unlikely(*result == 0))
+			(*overflow)++;
+	}
+	return local_clock() - start;
+}
+
+static void test_foo(void)
+{
+	const unsigned long cnt = 10 << 24;
+	short result = 0;
+	int overflow = 0;
+	unsigned long v1, v2;
+
+	v1 = test_smp_mb(cnt);
+	v2 = test_smp_id(cnt, &result, &overflow);
+
+	printk("%s: smp_mb %lu smp_id %lu result %d overflow %d\n",
+			__func__, v1, v2, (int)result, overflow);
+}
+
 static int __init blk_mq_init(void)
 {
+	test_foo();
+
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
 	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",


Thanks,
Ming