On Thu, Jun 29, 2017 at 11:58 PM, Jens Axboe <axboe@xxxxxxxxx> wrote: > On 06/29/2017 02:40 AM, Ming Lei wrote: >> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe <axboe@xxxxxxxxx> wrote: >>> On 06/28/2017 03:12 PM, Brian King wrote: >>>> This patch converts the in_flight counter in struct hd_struct from a >>>> pair of atomics to a pair of percpu counters. This eliminates a couple >>>> of atomics from the hot path. When running this on a Power system, to >>>> a single null_blk device with 80 submission queues, irq mode 0, with >>>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. >>> >>> This has been done before, but I've never really liked it. The reason is >>> that it means that reading the part stat inflight count now has to >>> iterate over every possible CPU. Did you use partitions in your testing? >>> How many CPUs were configured? When I last tested this a few years ago >>> on even a quad core nehalem (which is notoriously shitty for cross-node >>> latencies), it was a net loss. >> >> One year ago, I saw null_blk's IOPS can be decreased to 10% >> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has >> 96 cores, and dual numa nodes) too, the performance can be >> recovered basically if per numa-node counter is introduced and >> used in this case, but the patch was never posted out. >> If anyone is interested in that, I can rebase the patch on current >> block tree and post out. I guess the performance issue might be >> related with system cache coherency implementation more or less. >> This issue on ARM64 can be observed with the following userspace >> atomic counting test too: >> >> http://kernel.ubuntu.com/~ming/test/cache/ > > How well did the per-node thing work? Doesn't seem to me like it would Last time, on ARM64, I remembered that the IOPS was basically recovered, but now I don't have a such machine to test. Could Brian test the attached patch to see if it works on big Power machine? And the idea is simple, just make the atomic counter per-node. > go far enough. And per CPU is too much. One potential improvement would > be to change the part_stat_read() to just loop online CPUs, instead of > all possible CPUs. When CPUs go on/offline, use that as the slow path to > ensure the stats are sane. Often there's a huge difference between > NR_CPUS configured and what the system has. As Brian states, RH ships > with 2048, while I doubt a lot of customers actually run that... One observation I saw on arm64 dual socket before is that atomic inc/dec on counter stored in local numa node is much cheaper than cross-node, that is why I tried the per-node counter. And wrt. in-flight atomic counter, both inc and dec should happen on CPUs belonging to same numa node in case of blk-mq. > > Outside of coming up with a more clever data structure that is fully > CPU topology aware, one thing that could work is just having X cache > line separated read/write inflight counters per node, where X is some > suitable value (like 4). That prevents us from having cross node > traffic, and it also keeps the cross cpu traffic fairly low. That should > provide a nice balance between cost of incrementing the inflight > counting, and the cost of looping for reading it. > > And that brings me to the next part... > >>> I do agree that we should do something about it, and it's one of those >>> items I've highlighted in talks about blk-mq on pending issues to fix >>> up. It's just not great as it currently stands, but I don't think per >>> CPU counters is the right way to fix it, at least not for the inflight >>> counter. >> >> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe >> we can use some blk-mq knowledge(tagset?) to figure out the >> 'in_flight' counter. I thought about it before, but never got a >> perfect solution, and looks it is a bit hard, :-) > > The tags are already a bit spread out, so it's worth a shot. That would > remove the need to do anything in the inc/dec path, as the tags already > do that. The inlight count could be easily retrieved with > sbitmap_weight(). The only issue here is that we need separate read and > write counters, and the weight would obviously only get us the total > count. But we can have a slower path for that, just iterate the tags and > count them. The fast path only cares about total count. > > Let me try that out real quick. > > -- > Jens Axboe > Thanks, Ming Lei
diff --git a/block/partition-generic.c b/block/partition-generic.c index c5ec8246e25e..bd6644bf9643 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -140,8 +140,9 @@ ssize_t part_inflight_show(struct device *dev, { struct hd_struct *p = dev_to_part(dev); - return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]), - atomic_read(&p->in_flight[1])); + return sprintf(buf, "%8u %8u\n", + pnode_counter_read_all(&p->in_flight[0]), + pnode_counter_read_all(&p->in_flight[1])); } #ifdef CONFIG_FAIL_MAKE_REQUEST diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 96bd13e581cd..aac0b2235410 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -521,7 +521,7 @@ static void start_io_acct(struct dm_io *io) cpu = part_stat_lock(); part_round_stats(cpu, &dm_disk(md)->part0); part_stat_unlock(); - atomic_set(&dm_disk(md)->part0.in_flight[rw], + pnode_counter_set(&dm_disk(md)->part0.in_flight[rw], atomic_inc_return(&md->pending[rw])); if (unlikely(dm_stats_used(&md->stats))) @@ -550,7 +550,7 @@ static void end_io_acct(struct dm_io *io) * a flush. */ pending = atomic_dec_return(&md->pending[rw]); - atomic_set(&dm_disk(md)->part0.in_flight[rw], pending); + pnode_counter_set(&dm_disk(md)->part0.in_flight[rw], pending); pending += atomic_read(&md->pending[rw^0x1]); /* nudge anyone waiting on suspend queue */ diff --git a/include/linux/genhd.h b/include/linux/genhd.h index e619fae2f037..40c9bc74a120 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -15,6 +15,7 @@ #include <linux/slab.h> #include <linux/percpu-refcount.h> #include <linux/uuid.h> +#include <linux/pernode_counter.h> #ifdef CONFIG_BLOCK @@ -120,7 +121,7 @@ struct hd_struct { int make_it_fail; #endif unsigned long stamp; - atomic_t in_flight[2]; + struct pnode_counter in_flight[2]; #ifdef CONFIG_SMP struct disk_stats __percpu *dkstats; #else @@ -364,21 +365,22 @@ static inline void free_part_stats(struct hd_struct *part) static inline void part_inc_in_flight(struct hd_struct *part, int rw) { - atomic_inc(&part->in_flight[rw]); + pnode_counter_inc(&part->in_flight[rw]); if (part->partno) - atomic_inc(&part_to_disk(part)->part0.in_flight[rw]); + pnode_counter_inc(&part_to_disk(part)->part0.in_flight[rw]); } static inline void part_dec_in_flight(struct hd_struct *part, int rw) { - atomic_dec(&part->in_flight[rw]); + pnode_counter_dec(&part->in_flight[rw]); if (part->partno) - atomic_dec(&part_to_disk(part)->part0.in_flight[rw]); + pnode_counter_dec(&part_to_disk(part)->part0.in_flight[rw]); } static inline int part_in_flight(struct hd_struct *part) { - return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]); + return pnode_counter_read_all(&part->in_flight[0]) + \ + pnode_counter_read_all(&part->in_flight[1]); } static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk) @@ -627,11 +629,34 @@ extern ssize_t part_fail_store(struct device *dev, const char *buf, size_t count); #endif /* CONFIG_FAIL_MAKE_REQUEST */ +static inline int hd_counter_init(struct hd_struct *part) +{ + if (pnode_counter_init(&part->in_flight[0], GFP_KERNEL)) + return -ENOMEM; + if (pnode_counter_init(&part->in_flight[1], GFP_KERNEL)) { + pnode_counter_deinit(&part->in_flight[0]); + return -ENOMEM; + } + + return 0; +} + +static inline void hd_counter_deinit(struct hd_struct *part) +{ + pnode_counter_deinit(&part->in_flight[0]); + pnode_counter_deinit(&part->in_flight[1]); +} + static inline int hd_ref_init(struct hd_struct *part) { + if (hd_counter_init(part)) + return -ENOMEM; + if (percpu_ref_init(&part->ref, __delete_partition, 0, - GFP_KERNEL)) + GFP_KERNEL)) { + hd_counter_deinit(part); return -ENOMEM; + } return 0; } @@ -659,6 +684,7 @@ static inline void hd_free_part(struct hd_struct *part) { free_part_stats(part); free_part_info(part); + hd_counter_deinit(part); percpu_ref_exit(&part->ref); } diff --git a/include/linux/pernode_counter.h b/include/linux/pernode_counter.h new file mode 100644 index 000000000000..127639fbc25f --- /dev/null +++ b/include/linux/pernode_counter.h @@ -0,0 +1,118 @@ +#ifndef _LINUX_PERNODE_COUNTER_H +#define _LINUX_PERNODE_COUNTER_H +/* + * A simple per node atomic counter for use in block io accounting. + */ + +#include <linux/smp.h> +#include <linux/percpu.h> +#include <linux/types.h> +#include <linux/gfp.h> + +struct node_counter { + atomic_t counter_in_node; +}; + +struct pnode_counter { + struct node_counter * __percpu *counter; + struct node_counter **nodes; +}; + +static inline int pnode_counter_init(struct pnode_counter *pnc, gfp_t gfp) +{ + struct node_counter **nodes; + int i, j, cpu; + + nodes = kzalloc(nr_node_ids * sizeof(struct node_counter *), gfp); + if (!nodes) + goto err_nodes; + + for_each_node(i) { + nodes[i] = kzalloc_node(sizeof(struct node_counter), gfp, i); + if (!nodes[i]) + goto err_node_counter; + } + + pnc->counter = alloc_percpu_gfp(struct node_counter *, gfp); + if (!pnc->counter) + goto err_node_counter; + + for_each_possible_cpu(cpu) + *per_cpu_ptr(pnc->counter, cpu) = nodes[cpu_to_node(cpu)]; + + pnc->nodes = nodes; + + return 0; + + err_node_counter: + for (j = 0; j < i; j++) + kfree(nodes[j]); + kfree(nodes); + err_nodes: + return -ENOMEM; +} + +static inline void pnode_counter_deinit(struct pnode_counter *pnc) +{ + int cpu; + + for_each_possible_cpu(cpu) { + struct node_counter *node = *per_cpu_ptr(pnc->counter, cpu); + + kfree(node); + *per_cpu_ptr(pnc->counter, cpu) = NULL; + } + free_percpu(pnc->counter); + kfree(pnc->nodes); +} + +static inline void pnode_counter_inc(struct pnode_counter *pnc) +{ + struct node_counter *node = this_cpu_read(*pnc->counter); + + atomic_inc(&node->counter_in_node); +} + +static inline void pnode_counter_inc_cpu(struct pnode_counter *pnc, int cpu) +{ + struct node_counter *node = *per_cpu_ptr(pnc->counter, cpu); + + atomic_inc(&node->counter_in_node); +} + +static inline void pnode_counter_dec(struct pnode_counter *pnc) +{ + struct node_counter *node = this_cpu_read(*pnc->counter); + + atomic_dec(&node->counter_in_node); +} + +static inline void pnode_counter_dec_cpu(struct pnode_counter *pnc, int cpu) +{ + struct node_counter *node = *per_cpu_ptr(pnc->counter, cpu); + + atomic_dec(&node->counter_in_node); +} + +static inline void pnode_counter_set(struct pnode_counter *pnc, int val) +{ + int i; + struct node_counter *node = this_cpu_read(*pnc->counter); + + for_each_node(i) + atomic_set(&pnc->nodes[i]->counter_in_node, 0); + atomic_set(&node->counter_in_node, val); +} + +static inline long pnode_counter_read_all(struct pnode_counter *pnc) +{ + int i; + long val = 0; + + for_each_node(i) + val += atomic_read(&pnc->nodes[i]->counter_in_node); + + return val; +} + +#endif /* _LINUX_PERNODE_COUNTER_H */