On Thu, Jul 01, 2021 at 11:06:57AM +0100, John Garry wrote: > On 30/06/2021 04:51, Ming Lei wrote: > > The default queue mapping builder of blk_mq_map_queues doesn't take NUMA > > topo into account, so the built mapping is pretty bad, since CPUs > > belonging to different NUMA node are assigned to same queue. It is > > observed that IOPS drops by ~30% when running two jobs on same hctx > > of null_blk from two CPUs belonging to two NUMA nodes compared with > > from same NUMA node. > > > > Address the issue by reusing irq_create_affinity_masks() for building > > the default queue mapping, so that we can re-use the mapping created > > for managed irq. > > > > Lots of drivers may benefit from the change, such as nvme pci poll, > > nvme tcp, ... > > > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > --- > > block/blk-mq-cpumap.c | 60 +++++++++---------------------------------- > > 1 file changed, 12 insertions(+), 48 deletions(-) > > > > diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c > > index 3db84d3197f1..946e373296a3 100644 > > --- a/block/blk-mq-cpumap.c > > +++ b/block/blk-mq-cpumap.c > > @@ -10,67 +10,31 @@ > > #include <linux/mm.h> > > #include <linux/smp.h> > > #include <linux/cpu.h> > > +#include <linux/interrupt.h> > > Similar to what Christoph mentioned, seems strange to be including > interrupt.h The ideal way is to abstract & move the affinity building code into lib/, but it needs to refactor kernel/irq/affinity.c a bit. Also here each queue means one blk-mq hw queue, it is still not too strange to associate it with interrupt and re-use the interrupt affinity building code. Let's see how Thomas thinks about this usage. > > > #include <linux/blk-mq.h> > > #include "blk.h" > > #include "blk-mq.h" > > -static int queue_index(struct blk_mq_queue_map *qmap, > > - unsigned int nr_queues, const int q) > > -{ > > - return qmap->queue_offset + (q % nr_queues); > > -} > > - > > -static int get_first_sibling(unsigned int cpu) > > -{ > > - unsigned int ret; > > - > > - ret = cpumask_first(topology_sibling_cpumask(cpu)); > > - if (ret < nr_cpu_ids) > > - return ret; > > - > > - return cpu; > > -} > > - > > int blk_mq_map_queues(struct blk_mq_queue_map *qmap) > > { > > + struct irq_affinity_desc *masks = NULL; > > + struct irq_affinity affd = {0}; > > should this be simply {}? I forget... I think both should be fine, and two usages can be found in kernel code. > > > unsigned int *map = qmap->mq_map; > > unsigned int nr_queues = qmap->nr_queues; > > - unsigned int cpu, first_sibling, q = 0; > > + unsigned int q; > > - for_each_possible_cpu(cpu) > > - map[cpu] = -1; > > + masks = irq_create_affinity_masks(nr_queues, &affd); > > + if (!masks) > > + return -ENOMEM; > > should we fall back on something else here? Seems that this function does > not fail just because out of memory. The default case is nr_set == 1, so the only failure is out of memory, and irq_create_affinity_masks() basically creates cpumask for each vector/queue and assigns possible CPUs among these vector/queue. Thanks, Ming