Re: [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

Ming Lei <tom.leiming@xxxxxxxxx> · Fri, 30 Jun 2017 00:25:36 +0800

On Thu, Jun 29, 2017 at 11:58 PM, Jens Axboe <axboe@xxxxxxxxx> wrote:
> On 06/29/2017 02:40 AM, Ming Lei wrote:
>> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe <axboe@xxxxxxxxx> wrote:
>>> On 06/28/2017 03:12 PM, Brian King wrote:
>>>> This patch converts the in_flight counter in struct hd_struct from a
>>>> pair of atomics to a pair of percpu counters. This eliminates a couple
>>>> of atomics from the hot path. When running this on a Power system, to
>>>> a single null_blk device with 80 submission queues, irq mode 0, with
>>>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.
>>>
>>> This has been done before, but I've never really liked it. The reason is
>>> that it means that reading the part stat inflight count now has to
>>> iterate over every possible CPU. Did you use partitions in your testing?
>>> How many CPUs were configured? When I last tested this a few years ago
>>> on even a quad core nehalem (which is notoriously shitty for cross-node
>>> latencies), it was a net loss.
>>
>> One year ago, I saw null_blk's IOPS can be decreased to 10%
>> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
>> 96 cores, and dual numa nodes) too, the performance can be
>> recovered basically if per numa-node counter is introduced and
>> used in this case, but the patch was never posted out.
>> If anyone is interested in that, I can rebase the patch on current
>> block tree and post out. I guess the performance issue might be
>> related with system cache coherency implementation more or less.
>> This issue on ARM64 can be observed with the following userspace
>> atomic counting test too:
>>
>>        http://kernel.ubuntu.com/~ming/test/cache/
>
> How well did the per-node thing work? Doesn't seem to me like it would

Last time, on ARM64, I remembered that the IOPS was basically recovered,
but now I don't have a such machine to test. Could Brian test the attached patch
to see if it works on big Power machine?

And the idea is simple, just make the atomic counter per-node.

> go far enough. And per CPU is too much. One potential improvement would
> be to change the part_stat_read() to just loop online CPUs, instead of
> all possible CPUs. When CPUs go on/offline, use that as the slow path to
> ensure the stats are sane. Often there's a huge difference between
> NR_CPUS configured and what the system has. As Brian states, RH ships
> with 2048, while I doubt a lot of customers actually run that...

One observation I saw on arm64 dual socket before is that atomic inc/dec on
counter stored in local numa node is much cheaper than cross-node, that is
why I tried the per-node counter. And wrt. in-flight atomic counter, both inc
and dec should happen on CPUs belonging to same numa node in case of
blk-mq.

>
> Outside of coming up with a more clever data structure that is fully
> CPU topology aware, one thing that could work is just having X cache
> line separated read/write inflight counters per node, where X is some
> suitable value (like 4). That prevents us from having cross node
> traffic, and it also keeps the cross cpu traffic fairly low. That should
> provide a nice balance between cost of incrementing the inflight
> counting, and the cost of looping for reading it.
>
> And that brings me to the next part...
>
>>> I do agree that we should do something about it, and it's one of those
>>> items I've highlighted in talks about blk-mq on pending issues to fix
>>> up. It's just not great as it currently stands, but I don't think per
>>> CPU counters is the right way to fix it, at least not for the inflight
>>> counter.
>>
>> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe
>> we can use some blk-mq knowledge(tagset?) to figure out the
>> 'in_flight' counter. I thought about it before, but never got a
>> perfect solution, and looks it is a bit hard, :-)
>
> The tags are already a bit spread out, so it's worth a shot. That would
> remove the need to do anything in the inc/dec path, as the tags already
> do that. The inlight count could be easily retrieved with
> sbitmap_weight(). The only issue here is that we need separate read and
> write counters, and the weight would obviously only get us the total
> count. But we can have a slower path for that, just iterate the tags and
> count them. The fast path only cares about total count.
>
> Let me try that out real quick.
>
> --
> Jens Axboe
>



Thanks,
Ming Lei

diff --git a/block/partition-generic.c b/block/partition-generic.c
index c5ec8246e25e..bd6644bf9643 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -140,8 +140,9 @@ ssize_t part_inflight_show(struct device *dev,
 {
 	struct hd_struct *p = dev_to_part(dev);
 
-	return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]),
-		atomic_read(&p->in_flight[1]));
+	return sprintf(buf, "%8u %8u\n",
+		pnode_counter_read_all(&p->in_flight[0]),
+		pnode_counter_read_all(&p->in_flight[1]));
 }
 
 #ifdef CONFIG_FAIL_MAKE_REQUEST
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 96bd13e581cd..aac0b2235410 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -521,7 +521,7 @@ static void start_io_acct(struct dm_io *io)
 	cpu = part_stat_lock();
 	part_round_stats(cpu, &dm_disk(md)->part0);
 	part_stat_unlock();
-	atomic_set(&dm_disk(md)->part0.in_flight[rw],
+	pnode_counter_set(&dm_disk(md)->part0.in_flight[rw],
 		atomic_inc_return(&md->pending[rw]));
 
 	if (unlikely(dm_stats_used(&md->stats)))
@@ -550,7 +550,7 @@ static void end_io_acct(struct dm_io *io)
 	 * a flush.
 	 */
 	pending = atomic_dec_return(&md->pending[rw]);
-	atomic_set(&dm_disk(md)->part0.in_flight[rw], pending);
+	pnode_counter_set(&dm_disk(md)->part0.in_flight[rw], pending);
 	pending += atomic_read(&md->pending[rw^0x1]);
 
 	/* nudge anyone waiting on suspend queue */
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index e619fae2f037..40c9bc74a120 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/percpu-refcount.h>
 #include <linux/uuid.h>
+#include <linux/pernode_counter.h>
 
 #ifdef CONFIG_BLOCK
 
@@ -120,7 +121,7 @@ struct hd_struct {
 	int make_it_fail;
 #endif
 	unsigned long stamp;
-	atomic_t in_flight[2];
+	struct pnode_counter in_flight[2];
 #ifdef	CONFIG_SMP
 	struct disk_stats __percpu *dkstats;
 #else
@@ -364,21 +365,22 @@ static inline void free_part_stats(struct hd_struct *part)
 
 static inline void part_inc_in_flight(struct hd_struct *part, int rw)
 {
-	atomic_inc(&part->in_flight[rw]);
+	pnode_counter_inc(&part->in_flight[rw]);
 	if (part->partno)
-		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
+		pnode_counter_inc(&part_to_disk(part)->part0.in_flight[rw]);
 }
 
 static inline void part_dec_in_flight(struct hd_struct *part, int rw)
 {
-	atomic_dec(&part->in_flight[rw]);
+	pnode_counter_dec(&part->in_flight[rw]);
 	if (part->partno)
-		atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
+		pnode_counter_dec(&part_to_disk(part)->part0.in_flight[rw]);
 }
 
 static inline int part_in_flight(struct hd_struct *part)
 {
-	return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]);
+	return pnode_counter_read_all(&part->in_flight[0]) + \
+		pnode_counter_read_all(&part->in_flight[1]);
 }
 
 static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk)
@@ -627,11 +629,34 @@ extern ssize_t part_fail_store(struct device *dev,
 			       const char *buf, size_t count);
 #endif /* CONFIG_FAIL_MAKE_REQUEST */
 
+static inline int hd_counter_init(struct hd_struct *part)
+{
+	if (pnode_counter_init(&part->in_flight[0], GFP_KERNEL))
+		return -ENOMEM;
+	if (pnode_counter_init(&part->in_flight[1], GFP_KERNEL)) {
+		pnode_counter_deinit(&part->in_flight[0]);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static inline void hd_counter_deinit(struct hd_struct *part)
+{
+	pnode_counter_deinit(&part->in_flight[0]);
+	pnode_counter_deinit(&part->in_flight[1]);
+}
+
 static inline int hd_ref_init(struct hd_struct *part)
 {
+	if (hd_counter_init(part))
+		return -ENOMEM;
+
 	if (percpu_ref_init(&part->ref, __delete_partition, 0,
-				GFP_KERNEL))
+				GFP_KERNEL)) {
+		hd_counter_deinit(part);
 		return -ENOMEM;
+	}
 	return 0;
 }
 
@@ -659,6 +684,7 @@ static inline void hd_free_part(struct hd_struct *part)
 {
 	free_part_stats(part);
 	free_part_info(part);
+	hd_counter_deinit(part);
 	percpu_ref_exit(&part->ref);
 }
 
diff --git a/include/linux/pernode_counter.h b/include/linux/pernode_counter.h
new file mode 100644
index 000000000000..127639fbc25f
--- /dev/null
+++ b/include/linux/pernode_counter.h
@@ -0,0 +1,118 @@
+#ifndef _LINUX_PERNODE_COUNTER_H
+#define _LINUX_PERNODE_COUNTER_H
+/*
+ * A simple per node atomic counter for use in block io accounting.
+ */
+
+#include <linux/smp.h>
+#include <linux/percpu.h>
+#include <linux/types.h>
+#include <linux/gfp.h>
+
+struct node_counter {
+	atomic_t counter_in_node;
+};
+
+struct pnode_counter {
+	struct node_counter * __percpu  *counter;
+	struct node_counter **nodes;
+};
+
+static inline int pnode_counter_init(struct pnode_counter *pnc, gfp_t gfp)
+{
+	struct node_counter **nodes;
+	int i, j, cpu;
+
+	nodes = kzalloc(nr_node_ids * sizeof(struct node_counter *), gfp);
+	if (!nodes)
+		goto err_nodes;
+
+	for_each_node(i) {
+		nodes[i] = kzalloc_node(sizeof(struct node_counter), gfp, i);
+		if (!nodes[i])
+			goto err_node_counter;
+	}
+
+	pnc->counter = alloc_percpu_gfp(struct node_counter *, gfp);
+	if (!pnc->counter)
+		goto err_node_counter;
+
+	for_each_possible_cpu(cpu)
+		*per_cpu_ptr(pnc->counter, cpu) = nodes[cpu_to_node(cpu)];
+
+	pnc->nodes = nodes;
+
+	return 0;
+
+ err_node_counter:
+	for (j = 0; j < i; j++)
+		kfree(nodes[j]);
+	kfree(nodes);
+ err_nodes:
+	return -ENOMEM;
+}
+
+static inline void pnode_counter_deinit(struct pnode_counter *pnc)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct node_counter *node = *per_cpu_ptr(pnc->counter, cpu);
+
+		kfree(node);
+		*per_cpu_ptr(pnc->counter, cpu) = NULL;
+	}
+	free_percpu(pnc->counter);
+	kfree(pnc->nodes);
+}
+
+static inline void pnode_counter_inc(struct pnode_counter *pnc)
+{
+	struct node_counter *node = this_cpu_read(*pnc->counter);
+
+	atomic_inc(&node->counter_in_node);
+}
+
+static inline void pnode_counter_inc_cpu(struct pnode_counter *pnc, int cpu)
+{
+	struct node_counter *node = *per_cpu_ptr(pnc->counter, cpu);
+
+	atomic_inc(&node->counter_in_node);
+}
+
+static inline void pnode_counter_dec(struct pnode_counter *pnc)
+{
+	struct node_counter *node = this_cpu_read(*pnc->counter);
+
+	atomic_dec(&node->counter_in_node);
+}
+
+static inline void pnode_counter_dec_cpu(struct pnode_counter *pnc, int cpu)
+{
+	struct node_counter *node = *per_cpu_ptr(pnc->counter, cpu);
+
+	atomic_dec(&node->counter_in_node);
+}
+
+static inline void pnode_counter_set(struct pnode_counter *pnc, int val)
+{
+	int i;
+	struct node_counter *node = this_cpu_read(*pnc->counter);
+
+	for_each_node(i)
+		atomic_set(&pnc->nodes[i]->counter_in_node, 0);
+	atomic_set(&node->counter_in_node, val);
+}
+
+static inline long pnode_counter_read_all(struct pnode_counter *pnc)
+{
+	int i;
+	long val = 0;
+
+	for_each_node(i)
+		val += atomic_read(&pnc->nodes[i]->counter_in_node);
+
+	return val;
+}
+
+#endif /* _LINUX_PERNODE_COUNTER_H */