On Tue, 6 Sep 2016, Christoph Hellwig wrote: > [adding Thomas as it's about the affinity_mask he (we) added to the > IRQ core] > > On Tue, Sep 06, 2016 at 10:39:28AM -0400, Keith Busch wrote: > > > Always the previous one. Below is a patch to get us back to the > > > previous behavior: > > > > No, that's not right. > > > > Here's my topology info: > > > > # numactl --hardware > > available: 2 nodes (0-1) > > node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 > > node 0 size: 15745 MB > > node 0 free: 15319 MB > > node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 > > node 1 size: 16150 MB > > node 1 free: 15758 MB > > node distances: > > node 0 1 > > 0: 10 21 > > 1: 21 10 > > How do you get that mapping? Does this CPU use Hyperthreading and > thus expose siblings using topology_sibling_cpumask? As that's the > only thing the old code used for any sort of special casing. That's a normal Intel mapping with two sockets and HT enabled. The cpu enumeration is Socket0 - physical cores Socket1 - physical cores Socket0 - HT siblings Socket1 - HT siblings > I'll need to see if I can find a system with such a mapping to reproduce. Any 2 socket Intel with HT enabled will do. If you need access to one let me know. > > If I have 16 vectors, the affinity_mask generated by what you're doing > > looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each > > of those are the first unique CPU, getting a unique vector just like you > > wanted. If an unset bit just means share with the previous, then all of > > my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful! > > > > What we want for my CPU topology is the 16th CPU to pair with CPU 0, > > 17 pairs with 1, 18 with 2, and so on. You can't convey that information > > with this scheme. We need affinity_masks per vector. > > We actually have per-vector masks, but they are hidden inside the IRQ > core and awkward to use. We could to the get_first_sibling magic > in the block-mq queue mapping (and in fact with the current code I guess > we need to). Or take a step back from trying to emulate the old code > and instead look at NUMA nodes instead of siblings which some folks > suggested a while ago. I think you want both. NUMA nodes are certainly the first decision factor. You split the number of vectors to the nodes: vecs_per_node = num_vector / num_nodes; Then you spread the number of vectors per node by the number of cpus per node. cpus_per_vec = cpus_on(node) / vecs_per_node; If the number of cpus per vector is <= 1 you just use a round robin scheme. If not, you need to look at siblings. Looking at the whole thing, I think we need to be more clever when setting up the msi descriptor affinity masks. I'll send a RFC series soon. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html